Browse Models
Note: Stable Audio Open 1.0 weights are released under a Stable Audio Community License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host Stable Audio Open 1.0. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Stable Audio Open 1.0 is a text-to-audio generation model that creates stereo audio up to 47 seconds long. It combines a 156M parameter autoencoder, T5 text embedding, and a 1.1B parameter diffusion transformer. The model excels at sound effects but has limitations with vocals and speech generation.
Stable Audio Open 1.0 is a text-to-audio generative AI model capable of producing variable-length stereo audio up to 47 seconds at 44.1 kHz sampling rate. The model's architecture consists of three main components totaling 1.21 billion parameters: a 156M parameter autoencoder that compresses waveforms using convolutional blocks with ResNet-like layers and Snake activation functions, a 109M parameter T5-based text embedding system for text conditioning, and a 1057M parameter transformer-based diffusion (DiT) model operating in the autoencoder's latent space. This architecture is detailed in the research paper.
The model is a variant of Stable Audio 2.0, with the primary difference being its use of T5 text conditioning instead of CLAP. It was designed to be accessible to artists and researchers, supporting both usage and fine-tuning applications. The model's weights and implementation are publicly available through the Hugging Face repository.
The model was trained on a carefully curated dataset of 486,492 audio recordings, representing approximately 7,300 hours of audio. The majority of the training data comes from Freesound and the Free Music Archive (FMA), with all audio licensed under Creative Commons licenses (CC0, CC BY, or CC Sampling+). To ensure copyright compliance, the training data underwent rigorous verification using multiple tools including the PANNs music classifier and Audible Magic's content detection services.
The training process involved multiple stages. The autoencoder was trained on 5-second chunks of diverse audio, including a high-fidelity subset, taking 456 hours across multiple A100 GPUs. The DiT's training, which took 338 hours, utilized audio paired with text metadata derived from Freesound and FMA descriptions and tags. The training employed the AdamW optimizer with a learning rate scheduler.
Stable Audio Open 1.0 demonstrates competitive performance against state-of-the-art models, particularly in sound generation. Evaluation metrics including FD openl3, KL passt, and CLAP score show superior performance compared to other open-source models on the AudioCaps dataset for sound generation. In music generation, it performs slightly better than MusicGen on the Song Describer dataset, though not as well as other models in the Stable Audio family.
The model's inference speed varies by hardware, ranging from 8 steps/second on an RTX 3090 to 20 steps/second on an H100. The autoencoder achieves reconstruction quality comparable to Stable Audio 2.0, despite being trained exclusively on Creative Commons data. Notably, memorization analysis showed no evidence of the model reproducing its training data.
Several important limitations affect the model's performance. It struggles with generating realistic vocals and has difficulty with prompts containing connectors like "and" or "while." The model's performance in languages other than English is suboptimal, and its effectiveness varies across different musical styles and cultures. It generally performs better with sound effects and field recordings than with music generation.
The model can be used through either the stable-audio-tools
library or the diffusers
library, with prompt engineering often necessary for optimal results. Users should be aware that the model's training data may reflect biases present in the source datasets, potentially leading to uneven performance across different musical genres and cultural representations.