Fine-Tuning Diffusion Models for Audio Generation

Authors

  • Santiago Fiorino Universidad de Buenos Aires, Argentina
  • Pablo Riera Universidad de Buenos Aires, Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina

Keywords:

music, synthesis, diffusion, transformers

Abstract

Music has evolved alongside technological advancements, from primitive percussion to modern digital synthesis tools. Today, artificial intelligence plays important role in music generation, utilizing state-ofthe-art architectures like transformers and diffusion models to generate complete songs from natural language prompts. Proprietary models by Udio and Suno AI demonstrate great potential but limit scientific research due to their closed nature. In June 2024, Stability AI released Stable Audio Open (SAO), an open-source diffusion-based audio synthesis model, democratizing research in this field. While SAO excels in sound effect generation, its musical capabilities are limited by scarce open-license training data. Our research enhances SAO’s musical generation capabilities through fine-tuning on a specialized dataset, addressing its inability to generate certain instruments, difficulties with specified musical elements, and inconsistencies in tempo and tonality. We developed a custom datasetcreation pipeline by synthesizing audio from MIDI files, enriching metadata using APIs like Spotify and LastFM, and generating natural language prompts via large language models. This pipeline produced a 9-hour (538 minutes) music dataset comprising 1023 audios, which includes monophonic, polyphonic, and instrumental YouTube audio subsets in equal parts, spanning various genres, tempos, and tonalities. Results show significant improvements in the fine-tuned model (“Instrumental Finetune”) over the original SAO, particularly in sound quality, instrument reproduction accuracy, genre adherence, and tempo adherence (95.3% accuracy vs. 77.6%). Although tone and scale accuracy remain challenging, embedding-based metrics (KL-Passt, CLAP Score) indicate our model matches or surpasses both SAO and the commercial MusicGen, maintaining generalization despite domain-specific optimization. Auditory examples illustrating these improvements and confirming the absence of memorization are available on the Project Web. 

Downloads

Published

2025-10-15

Issue

Section

ASAID - Argentine Symposium on Artificial Intelligence and Data Science

How to Cite

Fiorino, S., & Riera, P. (2025). Fine-Tuning Diffusion Models for Audio Generation. JAIIO, Jornadas Argentinas De Informática, 11(1), 304-310. https://revistas.unlp.edu.ar/JAIIO/article/view/19827