Anonymization of Legal Documents using Large Language Models with Continued Pretraining and Finetuning
Keywords:
anonymization, entity extraction, continued pretraining, finetuning, legal domainAbstract
To perform inference and text generation with large language models trained on datasets containing court rulings and legal documents, it is essential to ensure the confidentiality of personal data and the protection of sensitive information. In this work, we propose a methodology for the anonymization of legal databases based on entity extraction using advanced language models. Two open-source language models, LLaMA 3.1 (8B) and Qwen 2.5 (7B) are evaluated. Each language model is trained in two stages: first, a continued pretraining phase in which the model is adapted to legal language, improving its ability to understand and generate text in this specialized domain. With this end, we use a corpus of more than 26,000 legal documents composed of legislation, legal doctrine, and case law. The impact of the pretraining phase is evaluated with metrics such as BLEU, BERTScore, and perplexity. In the second stage, a task-specific finetuning is performed for anonymization and entity extraction. This finetuning is conducted using a dataset consisting of 150 segments. The finetuning was evaluated on a test set of 50 segments, achieving 92.79% correct anonymization with Qwen 2.5 (7B) and 91.58% with LLaMA 3.1 (8B), improving by 4.73% and 12.87% respectively compared to the base model with finetuning, highlighting the influence of continued pretraining as a preliminary step. Both training phases, continued pretraining and finetuning, were conducted using LoRA.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Sofia Ornella Ortman, Luciana Belen Canteros, Francisco Vargas, Gaston Escalante, Alejandro González Coene, Manuel Pulido

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.











