Note: AudioLDM weights are released under a CC-BY-NC 4.0 License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
Laboratory OS
Launch a dedicated cloud GPU server running Laboratory OS to download and run AudioLDM using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Experiment with various cutting-edge audio generation models, such as Bark (Text-to-Speech), RVC (Voice Cloning), and MusicGen (Text-to-Music).
Model Report
audio-ldm / AudioLDM
AudioLDM is a text-to-audio generative model that creates speech, sound effects, and music from textual descriptions using latent diffusion techniques. The model employs Contrastive Language-Audio Pretraining (CLAP) embeddings and a variational autoencoder operating on mel-spectrogram representations. Trained on diverse datasets including AudioSet and AudioCaps, AudioLDM supports audio-to-audio generation, style transfer, super-resolution, and inpainting capabilities for creative and technical applications.
Explore the Future of AI
Your server, your data, under your control
AudioLDM (Audio Latent Diffusion Model) is a generative artificial intelligence system that enables the creation of speech, sound effects, and music directly from textual descriptions. As a text-to-audio model, AudioLDM also supports an array of text-guided audio manipulations, including audio-to-audio generation, style transfer, super-resolution, and inpainting. Introduced at the 40th International Conference on Machine Learning (ICML) in 2023, AudioLDM represents a text-conditioned approach to audio synthesis, building upon latent diffusion models and leveraging large-scale audio representations.
Technical Capabilities
AudioLDM is designed to interpret descriptive text prompts and produce corresponding audio samples, encompassing non-speech sound effects, music fragments, and speech synthesis. The model extends its generative abilities to audio-to-audio tasks, in which it synthesizes new audio based on the content or style of a reference audio file, optionally guided by textual instructions.
A notable feature of AudioLDM is its support for text-guided style transfer, allowing users to impose the characteristics of a style described in natural language onto an input audio file. The model also enables zero-shot audio manipulation tasks, such as super-resolution—wherein low-quality audio is upsampled to a higher fidelity—and inpainting, which involves filling in missing parts of an audio recording through semantic inference from the provided prompt.
Control over various sound qualities can be achieved, including the acoustic environment, source material, pitch, genre, and temporal structure, making AudioLDM suitable for a range of creative and technical audio applications.
trumpet.wav
AudioLDM audio-to-audio output. Demonstrates a trumpet sound as processed by AudioLDM on the input file trumpet.wav. [Source]
Model Architecture
AudioLDM operates within the framework of latent diffusion models, utilizing Contrastive Language-Audio Pretraining (CLAP) embeddings to bridge the gap between spoken language and acoustic data. The architecture encodes audio into a compressed latent space using a variational autoencoder (VAE) trained on mel-spectrogram representations, substantially reducing the complexity of downstream generative processes.
During model training, CLAP provides embeddings from audio for the diffusion process, while at generation time, text embeddings condition the synthesis. This decouples AudioLDM from the necessity of paired text-audio data, allowing training on large, unpaired audio-only datasets and facilitating greater flexibility in model development.
The latent diffusion process involves iterative denoising, where the model learns to reconstruct plausible audio representations from random noise in its latent space. Classifier-free guidance enables controllable generation by randomly omitting conditioning information during training, thereby supporting both conditional and unconditional audio synthesis scenarios. The final audio output is produced by decoding the latent representation through the VAE and then employing HiFi-GAN for high-fidelity waveform synthesis.
AudioLDM uses a U-Net backbone analogous to that found in prominent text-to-image diffusion models such as Stable Diffusion, incorporating feature-wise linear modulation to inject conditioning information throughout the generative network.
Training Data and Methodology
To enable generalizable and robust audio generation, AudioLDM was trained on a diverse assortment of publicly available datasets, including AudioSet, which comprises over 5,000 hours of annotated audio, AudioCaps for paired clips and captions, the Freesound database, and the BBC Sound Effect Library. The CLAP models powering the conditioning mechanisms were trained on LAION-Audio-630K, AudioSet, AudioCaps, and Clotho datasets, with data augmentation strategies employed to enhance caption diversity.
Audio data was standardized to a duration of 10 seconds, resampled to 16 kHz mono. For large datasets, only the initial 30 seconds of each file were segmented to create consistent training examples. Training of the underlying VAE and LDM networks employed established stochastic optimization methods, such as Adam, with learning rates and batch sizes tuned for both model convergence and computational feasibility.
A key methodological choice was the use of a mixup augmentation strategy, blending audio samples during training to expand the diversity of condition-label pairs seen by the model. The diffusion process itself uses deterministic sampling with DDIM, typically set at 200 steps for generation.
Evaluation and Performance
AudioLDM has been evaluated using both quantitative and qualitative metrics to benchmark its performance against other open-source text-to-audio models. Core evaluation metrics include the Frechet Distance (FD), Inception Score (IS), Kullback–Leibler (KL) divergence, and Frechet Audio Distance (FAD). Subjective human evaluations by audio professionals scored overall quality and relevance between output and input prompt.
In extensive tests using the AudioCaps benchmark, the large-scale AudioLDM-L-Full variant achieved an FD of 23.31 and an Inception Score of 8.13. These results surpass published performance scores of baseline models such as DiffSound and AudioGen across both objective and subjective domains, consistently demonstrating higher quality and semantic alignment in the generated audio signals.
Model variants tested include AudioLDM-S (small, 181M parameters), AudioLDM-L (large, 739M parameters), and variants fine-tuned on specific datasets. Performance improvements were observed with larger models and with increased training corpus size. Guidance scaling and sampling steps were systematically tuned to further enhance output quality, with diminishing returns noted above 100 sampling steps.
Applications and Limitations
AudioLDM is suited for a range of creative, research, and production applications. In content creation, it supports the generation of bespoke sound effects, background music, and speech for domains such as video editing, game development, and virtual or augmented reality. For creative professionals, features like style transfer and detailed prompt control can enrich music composition, sound design, and rapid audio prototyping.
The model also offers tools for audio repair, including super-resolution to enhance audio clarity and inpainting to reconstruct missing segments using context-aware artificial intelligence.
Despite these capabilities, AudioLDM faces several limitations. Output sampling rates, currently at 16 kHz, restrict the upper bounds of audio fidelity for sophisticated music or sound design applications. Alignment between independent model components (e.g., VAE latent space and diffusion model) may introduce reconstruction mismatches, suggesting potential benefits from future end-to-end fine-tuning. As with all generative models, ethical considerations—such as the potential misuse for generating misleading or deceptive audio—remain important in its deployment and study. The computational resources required for training are significant, though inference and smaller model variants reduce operational barriers.