Browse Models
Note: Magnet weights are released under a CC-BY-NC 4.0 License, and cannot be utilized for commercial purposes. Please read the license to verify if your use case is permitted.
The simplest way to self-host Magnet. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
MAGNeT is Meta AI's text-to-audio model that generates both music and sound effects using a non-autoregressive transformer architecture. It processes multiple audio token streams simultaneously at 32kHz, enabling up to 30-second outputs. Key innovations include span masking, restricted context, and classifier-free guidance annealing.
MAGNeT (Masked Audio Generation using Non-autoregressive Transformers) is a groundbreaking text-to-music and text-to-sound model developed by Meta AI's FAIR team between November 2023 and January 2024. As detailed in the original research paper, MAGNeT's key innovation lies in its single-stage, non-autoregressive transformer architecture that operates directly on multiple streams of audio tokens.
The model works with a 32kHz EnCodec tokenizer utilizing 4 codebooks sampled at 50 Hz. Unlike previous approaches, MAGNeT generates all four codebooks using a single non-autoregressive Transformer, eliminating the need for semantic token conditioning or model cascading. During inference, the output sequence is constructed gradually using several decoding steps, enhanced by a novel rescoring method that employs an external pre-trained model to rerank predictions.
MAGNeT was trained on an extensive dataset of 20,000 hours of licensed music from various sources, including the Meta Music Initiative Sound Collection, Shutterstock, and Pond5. The training process utilized 30-second audio crops sampled at 32kHz, with text preprocessing handled by a pre-trained T5 model for semantic representation extraction.
The model family includes several variants:
For sound effect generation, Audio-MAGNeT variants were trained on a diverse collection of public datasets including AudioSet, BBC sound effects, AudioCaps, Clotho v2, VGG-Sound, and several professional sound effect libraries.
MAGNeT achieves impressive performance metrics, with the facebook/magnet-medium-30secs
model scoring a FAD of 4.63, KLD of 1.20, and text consistency of 0.28 on the MusicCaps benchmark. Notably, the model operates up to 7 times faster than autoregressive baselines while maintaining comparable quality in both objective metrics and human evaluations.
The model employs several key technical innovations:
MAGNeT's code is released under the MIT license, while the model weights are available under a CC-BY-NC 4.0 license. The model is primarily intended for research purposes in AI-based music generation, targeting researchers and enthusiasts in audio, machine learning, and AI fields.
Users should be aware of certain limitations, including potential biases in vocal generation, cross-lingual performance, and genre representation. The model may also experience occasional generation collapse, requiring careful consideration in downstream applications.