Fine-Tuning Diffusion Models for Audio Generation
Keywords:
music, synthesis, diffusion, transformersAbstract
Music has evolved alongside technological advancements, from primitive percussion to modern digital synthesis tools. Today, artificial intelligence plays important role in music generation, utilizing state-ofthe-art architectures like transformers and diffusion models to generate complete songs from natural language prompts. Proprietary models by Udio and Suno AI demonstrate great potential but limit scientific research due to their closed nature. In June 2024, Stability AI released Stable Audio Open (SAO), an open-source diffusion-based audio synthesis model, democratizing research in this field. While SAO excels in sound effect generation, its musical capabilities are limited by scarce open-license training data. Our research enhances SAO’s musical generation capabilities through fine-tuning on a specialized dataset, addressing its inability to generate certain instruments, difficulties with specified musical elements, and inconsistencies in tempo and tonality. We developed a custom datasetcreation pipeline by synthesizing audio from MIDI files, enriching metadata using APIs like Spotify and LastFM, and generating natural language prompts via large language models. This pipeline produced a 9-hour (538 minutes) music dataset comprising 1023 audios, which includes monophonic, polyphonic, and instrumental YouTube audio subsets in equal parts, spanning various genres, tempos, and tonalities. Results show significant improvements in the fine-tuned model (“Instrumental Finetune”) over the original SAO, particularly in sound quality, instrument reproduction accuracy, genre adherence, and tempo adherence (95.3% accuracy vs. 77.6%). Although tone and scale accuracy remain challenging, embedding-based metrics (KL-Passt, CLAP Score) indicate our model matches or surpasses both SAO and the commercial MusicGen, maintaining generalization despite domain-specific optimization. Auditory examples illustrating these improvements and confirming the absence of memorization are available on the Project Web.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Santiago Fiorino, Pablo Riera

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Acorde a estos términos, el material se puede compartir (copiar y redistribuir en cualquier medio o formato) y adaptar (remezclar, transformar y crear a partir del material otra obra), siempre que a) se cite la autoría y la fuente original de su publicación (revista y URL de la obra), b) no se use para fines comerciales y c) se mantengan los mismos términos de la licencia.











