Browse Models
Note: MusicGen weights are released under a CC-BY-NC 4.0 License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host MusicGen. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
MusicGen is Meta AI's single-pass music generation model that creates music from text or audio prompts using an autoregressive transformer architecture. Available in sizes from 300M to 3.3B parameters, it generates audio at 32kHz using parallel prediction and can incorporate melody conditioning. Trained on 20,000 hours of instrumental music.
MusicGen is a single-stage autoregressive transformer model developed by Meta AI's FAIR team, designed for high-quality music generation. The model can generate music based on text descriptions or audio prompts, offering a simpler approach compared to previous cascading or hierarchical models. Published in the paper "Simple and Controllable Music Generation", MusicGen represents a significant advancement in AI-powered music creation.
The model operates using a 32kHz EnCodec tokenizer with four codebooks sampled at 50 Hz, generating all codebooks in a single pass. Unlike models like MusicLM, MusicGen bypasses the need for a self-supervised semantic representation, achieving 50 autoregressive steps per second of audio through parallel prediction with a small delay between codebooks. The architecture includes codebook projection, positional embedding, and a causal self-attention block with cross-attention for conditioning.
Text conditioning is implemented using a T5 text encoder, which can be optionally combined with unsupervised melody conditioning through chromagram analysis. The model employs classifier-free guidance during sampling and uses Flash Attention for improved efficiency. For stereo audio generation, the model applies codebook interleaving patterns to left and right channels independently, achieving high-quality stereo output with minimal computational overhead.
Several MusicGen variants exist in the model family:
The model was trained on an extensive dataset of 20,000 hours of licensed music from Meta's Sound Collection, Shutterstock, and Pond5. The training data was preprocessed to remove vocals using source separation via HT-Demucs. All audio was sampled at 32 kHz and included metadata such as genre, BPM, and tags.
MusicGen's performance has been extensively evaluated using both automatic metrics and human assessments. The 3.3B parameter variant achieved a subjective quality rating of 84.8/100 on the MusicCaps benchmark, surpassing previous models like MusicLM (80.5/100) and Mousai (76.11/100).
Evaluation metrics included:
While the model shows promising results, it does have some limitations, particularly in:
The code for MusicGen is released under the MIT license, while the model weights are licensed under CC-BY-NC 4.0. The model and its variants are available through the Audiocraft GitHub repository, and the project was officially released in October 2023.