Browse Models
Note: AudioLDM weights are released under a CC-BY-NC 4.0 License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host AudioLDM. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
AudioLDM generates audio from text descriptions using latent diffusion models and CLAP embeddings. It converts text prompts into mel-spectrograms, then to audio waveforms via HiFi-GAN. Multiple variants exist, with fine-tuned versions optimized for music and general audio generation using MusicCaps and AudioCaps datasets.
AudioLDM represents a significant advancement in text-to-audio generation, utilizing latent diffusion models (LDMs) to create high-quality audio from text descriptions. The system, detailed in an ICML 2023 paper, introduces a novel approach that leverages contrastive language-audio pretraining (CLAP) to learn continuous audio representations.
The model's architecture centers around a mel-spectrogram-based variational autoencoder (VAE) that encodes audio into a latent space where the LDMs operate. It employs CLAP embeddings to create a shared audio-text embedding space, eliminating the need for paired audio-text data during LDM training. The system uses classifier-free guidance (CFG) during sampling, with text embeddings steering the generation process. A HiFi-GAN vocoder transforms the reconstructed mel-spectrograms into the final audio waveform.
The development team found that a compression level of r=4 in the latent space provided the optimal balance between computational efficiency and generation quality. The training process incorporates a mixup strategy for audio data augmentation, enhancing the model's robustness.
AudioLDM comes in several variants, each optimized for different use cases:
audioldm-m-full
: The recommended starting point for most usersaudioldm-s-full
and audioldm-s-full-v2
: Smaller model variantsaudioldm-l-full
: The largest model variantaudioldm-s-text-ft
and audioldm-m-text-ft
: Variants fine-tuned with MusicCaps and AudioCaps datasetsThe model demonstrates versatile capabilities beyond basic text-to-audio generation, including:
AudioLDM was trained on a comprehensive dataset combining AudioSet, AudioCaps, Freesound, and the BBC Sound Effects library. As detailed on the project's GitHub page, the training leveraged the UK copyright exception for academic research.
The model achieves state-of-the-art performance in text-to-audio generation, outperforming baseline models like DiffSound and AudioGen. Performance evaluation encompasses both objective metrics (Frechet distance, Inception score, Kullback-Leibler divergence, Frechet audio distance) and subjective assessments (overall quality, relevance to text).
AudioLDM is available through multiple interfaces:
For optimal results, users should consider:
transfer_strength
parameter for style transfer operationsguidance_scale
, ddim_steps
, and duration
The model is licensed under cc-by-nc-nd-4.0, as noted in the model repository.