Note: Stable Audio Open 1.0 weights are released under a Stable Audio Community License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
Laboratory OS
Launch a dedicated cloud GPU server running Laboratory OS to download and run Stable Audio Open 1.0 using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Model Report
stabilityai / Stable Audio Open 1.0
Stable Audio Open 1.0 is an open-weight text-to-audio synthesis model developed by Stability AI with approximately 1.21 billion parameters. Built on latent diffusion architecture with transformer components and T5-based text conditioning, the model generates up to 47 seconds of stereo audio at 44.1 kHz. Trained exclusively on Creative Commons-licensed data totaling 7,300 hours, it demonstrates strong performance for sound effects and field recordings while showing modest capabilities for instrumental music generation.
Explore the Future of AI
Your server, your data, under your control
Overview
Stable Audio Open 1.0 is an open-weight generative AI model developed by Stability AI for text-to-audio synthesis. Designed as a research foundation, the model enables researchers and artists to fine-tune, experiment, and advance text-guided audio generation. Stable Audio Open 1.0 serves as an accessible baseline for the development of new generative audio models, emphasizing transparency by releasing weights, training code, and detailed evaluation metrics. The model and its capabilities are documented in the research publication available on arXiv:2407.14358.
The Stable Audio Open 1.0 logo, representing Stability AI's open-source text-to-audio synthesis model.
Stable Audio Open 1.0 is based on a latent diffusion architecture with integrated transformer components, similar to its predecessor Stable Audio 2.0, but notably adopts T5-based text conditioning instead of CLAP-based conditioning. The model has approximately 1.21 billion parameters and integrates three central modules: a variational autoencoder, a T5 text encoder, and a transformer diffusion model, as documented in the research publication.
The autoencoder compresses stereo waveforms sampled at 44.1 kHz into a latent representation with a size of 64, utilizing five convolutional blocks for downsampling and expansion. Reconstruction loss is calculated using perceptually weighted, multi-resolution STFT, balancing mid-side and left-right stereo presentation. The text encoder, leveraging the T5-base architecture, processes natural language prompts to condition audio output. The diffusion transformer (DiT) operates in the autoencoder’s latent space through stacked attention blocks and gated multilayer perceptrons with rotary positional embeddings, supporting variable-length and timing-conditioned generation. Conditioning on both timing and text is introduced via cross-attention mechanisms.
Technical enhancements such as block-wise attention and gradient checkpointing are implemented to manage computational and memory demands. These choices enable the model to efficiently handle variable-length generation—up to 47 seconds of stereo audio—by appending silence to shorter outputs, which can be post-processed by users (details available on the Hugging Face model page).
Training Data and Methodology
Training of Stable Audio Open 1.0 was exclusively conducted on audio licensed under Creative Commons, prioritizing scientific transparency and avoiding proprietary datasets, as detailed in the research paper. The core dataset comprises 486,492 audio recordings, totaling approximately 7,300 hours. Of these, 472,618 samples are sourced from the Freesound database, while 13,874 are drawn from the Free Music Archive (FMA). Comprehensive filtering was undertaken to eliminate copyrighted material using tools such as the PANNs music classifier and Audible Magic’s identification, alongside metadata cross-references with the Spotify tracks dataset, with manual review for flagged items.
Training text prompts were constructed from descriptive metadata. For Freesound audio, prompts utilized natural language descriptions, titles, and tags. For FMA, metadata such as year, genre, album, artist, and title contributed to prompt generation, with further random transformations for diversity.
The variational autoencoder was trained on 5-second snippets of high-fidelity audio, including 48kHz and 44.1kHz files. Training employed AdamW optimization, with distinct batch sizes for encoder and decoder, and learning rates adapted for each module. The DiT was trained on 1,024 latent tokens—a sequence corresponding to nearly 47 seconds of audio. Regularization strategies and exponential learning-rate scheduling were incorporated to improve training stability and performance, as outlined in arXiv:2407.14358.
Performance and Evaluation
Model performance was evaluated using multiple metrics—FD_openl3, KL_passt, and CLAP score—addressing both fidelity and prompt relevance. On the AudioCaps dataset for general sounds and field recordings, Stable Audio Open achieved a FD_openl3 of 78.24 (lower is better), KL_passt of 2.14 (lower is better), and CLAP score of 0.29 (higher is better). These results indicate strong performance for realistic sound and field recording synthesis, surpassing contemporary open models such as AudioLDM2 and AudioGen.
On the Song Describer dataset, which evaluates instrumental music generation, the model performed modestly compared to proprietary and previous Stable Audio models (FD_openl3: 96.51, KL_passt: 0.55, CLAP: 0.41) but slightly exceeded MusicGen, the leading open alternative at that time.
Autoencoder reconstruction was assessed using STFT distance, MEL distance, and SI-SDR. Results showed parity with Stable Audio 2.0 on general sounds and slightly lower music reconstruction quality, attributable to the model’s exclusive use of Creative Commons data. Memorization analyses found no evidence of unauthorized reproduction of training data.
In terms of efficiency, inference speed for diffusion on common hardware ranged from 8 to 20 steps per second, depending on GPU memory. Decoding from latent space to waveform is memory-intensive, but chunked decoding offers substantial optimization (see arXiv:2407.14358).
Use Cases and Model Limitations
The open availability and reproducibility of Stable Audio Open 1.0 make it applicable for AI research in audio synthesis, academic studies on generative models, and technical exploration by practitioners interested in fine-tuning or benchmarking generative audio capabilities (further details on the Hugging Face model page). The model demonstrates particular strength for high-quality, text-guided sound effect and field recording generation.
Notable limitations remain. Stable Audio Open 1.0 does not generate realistic vocals or intelligible speech, and performance is primarily tuned to English-language prompts due to the composition of the training metadata. Music generation capabilities are less robust than those of certain proprietary or non-CC-trained models, reflecting the dataset’s constraints. Audio quality and stylistic breadth may also be uneven for some musical genres or cultural forms. For complex prompts, especially those using conjunctions, careful prompt design may be required to achieve optimal outcomes.
Comparison to Other Models
Stable Audio Open 1.0 represents a research branch diverging from previous Stable Audio models in both dataset curation and the conditional text encoder. While Stable Audio 1.0 and 2.0 support longer audio generation and were trained using broader data that includes non-CC content, Stable Audio Open 1.0 focuses strictly on Creative Commons sources and T5-based conditioning for open research utility. In side-by-side evaluations, Stable Audio Open 1.0 generally outperformed open baselines such as AudioLDM2 and AudioGen for non-musical sound synthesis and achieved comparable or slightly superior results to MusicGen on instrumental music, albeit with lower performance than non-open Stable Audio models for music fidelity.