Evaluating transfer learning for classification of proteins in bioinformatics

Rosario Vitale; Georgina Stegmayer

Evaluating transfer learning for classification of proteins in bioinformatics

Authors

Rosario Vitale sinc(i)-FICH-UNL
Georgina Stegmayer sinc(i)-FICH-UNL

Keywords:

Machine learning, Transfer learning, Classification, Protein family

Abstract

This study presents a solution to significantly improve protein classification into families or domains using transfer learning. With more than 229 million proteins in UniProtKB, only 0.25% of them have been annotated and classified into over 17,000 possible families. Recently, deep learning (DL) models appeared for this task. However, DL models require large amounts of data for training, and most protein families have just a few examples. To tackle this issue, we propose the application of Transfer Learning (TL) to the classification problem. The TL approach involves self-supervised learning on large and unlabeled datasets to generate a numerical embedding for each data point. This representation learned can then be used with supervised learning on a small, labeled dataset for a specific classification task. The results achieved in this study indicate that using TL for protein families classification can reduce the prediction error by 55% compared to standard methods and by 32% compared to DL models with simple input representations such as one-hot encoding. This study demonstrates that transfer learning is an effective and promising technique to improve protein classification and annotation in large and yet un-annotated databases.

Downloads

pdf (Spanish)

Published

2023-07-07

Issue

Vol. 9 No. 2 (2023): ASAI - Argentine Symposium on Artificial Intelligence

Section

ASAI - Simposio Argentino de Inteligencia Artificial

License

Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.

How to Cite

Vitale, R., & Stegmayer, G. (2023). Evaluating transfer learning for classification of proteins in bioinformatics. JAIIO, Jornadas Argentinas De Informática, 9(2), 25-36. https://revistas.unlp.edu.ar/JAIIO/article/view/18083