Investigando la distorsión de frecuencia en Word Embeddings y su impacto en métricas de sesgo

Francisco Valentini; Juan Cruz Sosa; Diego Slezak; Edgar Altszyler

Investigating the Frequency Distortion of Word Embeddings and its Impact on Bias Metrics

Authors

Francisco Valentini Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina
Juan Cruz Sosa Universidad de Buenos Aires, Argentina
Diego Slezak Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina
Edgar Altszyler Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina

Keywords:

natural language processing, word embedding, bias

Abstract

Recent research has shown that static word embeddings can encode words’ frequencies. However, little has been studied about this behavior. In the present work, we study how frequency and semantic similarity relate to one another in static word embeddings, and we assess the impact of this relationship on embedding-based bias metrics. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled, and holds for different hyperparameter settings. This proves that the patterns we find are neither due to real semantic associations nor to specific parameters choices, and are an artifact produced by the word embeddings. To illustrate how frequencies can affect the measurement of biases related to gender, ethnicity, and affluence, we carry out a controlled experiment that shows that biases can even change sign or reverse their order when word frequencies change.

Downloads

Published

2025-10-15

Issue

Vol. 11 No. 1 (2025): ASAID – Argentine Symposium on Artificial Intelligence and Big Data

Section

Original papers

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.

How to Cite

Valentini, F., Sosa, J. C., Slezak, D., & Altszyler, E. (2025). Investigating the Frequency Distortion of Word Embeddings and its Impact on Bias Metrics. JAIIO, Jornadas Argentinas De Informática, 11(1), 85-86. https://revistas.unlp.edu.ar/JAIIO/article/view/19756