Imputación de genotipos faltantes mediante algoritmos de machine learning

M. Agustina Raschia; Pablo J. Ríos; Marcela E. Cordoba; M. Eugenia Caffaro; M. Valeria Donzelli; Daniel O. Maizon; Mario A. Poli

Imputation of missing genotypes using machine learning algorithms

Authors

M. Agustina Raschia Universidad Nacional de La Plata, Instituto Nacional de Tecnología Agropecuaria, Argentina https://orcid.org/0000-0002-0662-4397
Pablo J. Ríos Universidad de Buenos Aires, Argentina https://orcid.org/0000-0002-9768-7587
Marcela E. Cordoba Instituto Nacional de Tecnología Agropecuaria, Argentina https://orcid.org/0009-0000-9774-6295
M. Eugenia Caffaro Instituto Nacional de Tecnología Agropecuaria, Argentina https://orcid.org/0000-0002-5814-2293
M. Valeria Donzelli Universidad Nacional de Lomas de Zamora, Instituto Nacional de Tecnología Agropecuaria, Argentina https://orcid.org/0009-0009-4243-4652
Daniel O. Maizon Universidad Nacional de La Pampa, Instituto Nacional de Tecnología Agropecuaria, Argentina https://orcid.org/0000-0002-2701-4109
Mario A. Poli Universidad del Salvador, Instituto Nacional de Tecnología Agropecuaria, Argentina https://orcid.org/0000-0001-8775-2333

Keywords:

imputation, machine learning, random forest, single nucleotide polymorphism

Abstract

The imputation or inference of missing genotypes using correlations between variants obtained from reference panels can be carried out by specific programs that utilize family and/or population genetic information or by implementing machine learning algorithms. The objective of this study was to evaluate the imputation accuracy achieved using different machine learning strategies by comparing imputed genotypes with those obtained by genotyping with a medium-density SNP microarray. To compare the performance of three imputation strategies using the random forest algorithm, we analyzed a database containing 966 sheep genotyped at 57,876 SNPs, where 53.4% of the data was missing. A subset of the imputed genotypes, corresponding to 232 animals at 30,924 SNPs, was compared with genotypes obtained by genotyping. The percentage of concordance obtained for the three strategies was approximately 60%. This low percentage can be attributed to the large number of missing genotypes in the source file. One strategy for increasing imputation accuracy would be to increase the number of animals in the reference population and thus reduce the proportion of missing genotypes in the data set.

Downloads

pdf (Spanish)

Published

2025-09-30

Issue

Vol. 11 No. 3 (2025): CAI - Argentine Agroinformatics Congress

Section

CAI - Congreso Argentino de AgroInformática

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.

How to Cite

Raschia, M. A., Ríos, P. J., Cordoba, M. E., Caffaro, M. E., Donzelli, M. V., Maizon, D. O., & Poli, M. A. (2025). Imputation of missing genotypes using machine learning algorithms. JAIIO, Jornadas Argentinas De Informática, 11(3), 155-165. https://revistas.unlp.edu.ar/JAIIO/article/view/19680

Download Citation