Imputation of missing genotypes using machine learning algorithms

Authors

Keywords:

imputation, machine learning, random forest, single nucleotide polymorphism

Abstract

The imputation or inference of missing genotypes using correlations between variants obtained from reference panels can be carried out by specific programs that utilize family and/or population genetic information or by implementing machine learning algorithms. The objective of this study was to evaluate the imputation accuracy achieved using different machine learning strategies by comparing imputed genotypes with those obtained by genotyping with a medium-density SNP microarray. To compare the performance of three imputation strategies using the random forest algorithm, we analyzed a database containing 966 sheep genotyped at 57,876 SNPs, where 53.4% of the data was missing. A subset of the imputed genotypes, corresponding to 232 animals at 30,924 SNPs, was compared with genotypes obtained by genotyping. The percentage of concordance obtained for the three strategies was approximately 60%. This low percentage can be attributed to the large number of missing genotypes in the source file. One strategy for increasing imputation accuracy would be to increase the number of animals in the reference population and thus reduce the proportion of missing genotypes in the data set. 

Downloads

Published

2025-09-30

Issue

Section

CAI - Congreso Argentino de AgroInformática

How to Cite

Raschia, M. A., Ríos, P. J., Cordoba, M. E., Caffaro, M. E., Donzelli, M. V., Maizon, D. O., & Poli, M. A. (2025). Imputation of missing genotypes using machine learning algorithms. JAIIO, Jornadas Argentinas De Informática, 11(3), 155-165. https://revistas.unlp.edu.ar/JAIIO/article/view/19680