Anonymization of Legal Documents using Large Language Models with Continued Pretraining and Finetuning

Authors

Keywords:

anonymization, entity extraction, continued pretraining, finetuning, legal domain

Abstract

To perform inference and text generation with large language models trained on datasets containing court rulings and legal documents, it is essential to ensure the confidentiality of personal data and the protection of sensitive information. In this work, we propose a methodology for the anonymization of legal databases based on entity extraction using advanced language models. Two open-source language models, LLaMA 3.1 (8B) and Qwen 2.5 (7B) are evaluated. Each language model is trained in two stages: first, a continued pretraining phase in which the model is adapted to legal language, improving its ability to understand and generate text in this specialized domain. With this end, we use a corpus of more than 26,000 legal documents composed of legislation, legal doctrine, and case law. The impact of the pretraining phase is evaluated with metrics such as BLEU, BERTScore, and perplexity. In the second stage, a task-specific finetuning is performed for anonymization and entity extraction. This finetuning is conducted using a dataset consisting of 150 segments. The finetuning was evaluated on a test set of 50 segments, achieving 92.79% correct anonymization with Qwen 2.5 (7B) and 91.58% with LLaMA 3.1 (8B), improving by 4.73% and 12.87% respectively compared to the base model with finetuning, highlighting the influence of continued pretraining as a preliminary step. Both training phases, continued pretraining and finetuning, were conducted using LoRA. 

Downloads

Published

2025-10-15

Issue

Section

ASAID - Argentine Symposium on Artificial Intelligence and Data Science

How to Cite

Ortman, S. O., Canteros, L. B., Vargas, F., Escalante, G., González Coene, A., & Pulido, M. (2025). Anonymization of Legal Documents using Large Language Models with Continued Pretraining and Finetuning. JAIIO, Jornadas Argentinas De Informática, 11(1), 325-339. https://revistas.unlp.edu.ar/JAIIO/article/view/19829