Ajuste de modelos de difusi´on para la generaci´on de audio

Santiago Fiorino; Pablo Riera

Fine-Tuning Diffusion Models for Audio Generation

Authors

Santiago Fiorino Universidad de Buenos Aires, Argentina
Pablo Riera Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina

Keywords:

music, synthesis, diffusion, transformers

Abstract

Music has evolved alongside technological advancements, from primitive percussion to modern digital synthesis tools. Today, artificial intelligence plays important role in music generation, utilizing state-ofthe-art architectures like transformers and diffusion models to generate complete songs from natural language prompts. Proprietary models by Udio and Suno AI demonstrate great potential but limit scientific research due to their closed nature. In June 2024, Stability AI released Stable Audio Open (SAO), an open-source diffusion-based audio synthesis model, democratizing research in this field. While SAO excels in sound effect generation, its musical capabilities are limited by scarce open-license training data. Our research enhances SAO’s musical generation capabilities through fine-tuning on a specialized dataset, addressing its inability to generate certain instruments, difficulties with specified musical elements, and inconsistencies in tempo and tonality. We developed a custom datasetcreation pipeline by synthesizing audio from MIDI files, enriching metadata using APIs like Spotify and LastFM, and generating natural language prompts via large language models. This pipeline produced a 9-hour (538 minutes) music dataset comprising 1023 audios, which includes monophonic, polyphonic, and instrumental YouTube audio subsets in equal parts, spanning various genres, tempos, and tonalities. Results show significant improvements in the fine-tuned model (“Instrumental Finetune”) over the original SAO, particularly in sound quality, instrument reproduction accuracy, genre adherence, and tempo adherence (95.3% accuracy vs. 77.6%). Although tone and scale accuracy remain challenging, embedding-based metrics (KL-Passt, CLAP Score) indicate our model matches or surpasses both SAO and the commercial MusicGen, maintaining generalization despite domain-specific optimization. Auditory examples illustrating these improvements and confirming the absence of memorization are available on the Project Web.

Downloads

pdf (Spanish)

Published

2025-10-15

Issue

Vol. 11 No. 1 (2025): ASAID – Argentine Symposium on Artificial Intelligence and Big Data

Section

ASAID - Argentine Symposium on Artificial Intelligence and Data Science

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.

How to Cite

Fiorino, S., & Riera, P. (2025). Fine-Tuning Diffusion Models for Audio Generation. JAIIO, Jornadas Argentinas De Informática, 11(1), 304-310. https://revistas.unlp.edu.ar/JAIIO/article/view/19827

Download Citation