Investigating the Frequency Distortion of Word Embeddings and its Impact on Bias Metrics
Keywords:
natural language processing, word embedding, biasAbstract
Recent research has shown that static word embeddings can encode words’ frequencies. However, little has been studied about this behavior. In the present work, we study how frequency and semantic similarity relate to one another in static word embeddings, and we assess the impact of this relationship on embedding-based bias metrics. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled, and holds for different hyperparameter settings. This proves that the patterns we find are neither due to real semantic associations nor to specific parameters choices, and are an artifact produced by the word embeddings. To illustrate how frequencies can affect the measurement of biases related to gender, ethnicity, and affluence, we carry out a controlled experiment that shows that biases can even change sign or reverse their order when word frequencies change.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Francisco Valentini, Juan Cruz Sosa, Diego Slezak, Edgar Altszyler

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.











