MAGNeT (Masked Audio Generation using a Single Non-Autoregressive Transformer) is a generative artificial intelligence model introduced by Meta AI's FAIR team for synthesizing music and sound directly from textual descriptions. The MAGNeT framework is engineered to deliver high-quality, text-conditioned audio samples using a single-stage, non-autoregressive Transformer. MAGNeT is designed for both research on generative music systems and practical low-latency audio applications, supporting text-to-music as well as text-to-audio (sound effects) synthesis.
Model Architecture and Technical Foundations
At the core of MAGNeT is a non-autoregressive Transformer, allowing the model to generate all four audio codebooks in parallel rather than sequentially. This stands in contrast to earlier autoregressive models such as MusicGen, which require output tokens to be produced one after another. MAGNeT utilizes the EnCodec framework for audio tokenization, operating with a 32 kHz EnCodec tokenizer that maps audio to four discrete codebooks sampled at 50 Hz, each with 2048 possible tokens. This tokenization enables the model to reconstruct high-fidelity audio from compact, discrete representations, facilitating efficient learning and rapid inference as described in the MAGNeT paper.
The masking scheme in MAGNeT innovates on conventional token-level infilling by operating on spans of tokens, typically 60 milliseconds in duration, to better capture temporal dependencies inherent in audio. The self-attention mechanism for codebooks beyond the first is temporally restricted, attending only to tokens within an approximately 200 millisecond window, which supports computational efficiency while leveraging the local structure of the residual vector quantization (RVQ) hierarchy.
Conditioning on natural language descriptions is achieved using semantic representations from a pre-trained T5 model. During training and inference, a classifier-free guidance (CFG) annealing strategy is applied: guidance strength is gradually reduced as token masking decreases, shifting the model's focus from textual alignment to context-aware sequence completion. For optimization, MAGNeT employs the AdamW optimizer with a cosine learning rate schedule and utilizes Flash Attention for improved computational performance.
Training Regimen and Data Sources
MAGNeT models were trained between November 2023 and January 2024. For text-to-music generation, the training data consist of 20,000 hours of licensed audio, primarily instrumental tracks, including material from the Meta Music Initiative Sound Collection, Shutterstock music collections, and Pond5, all sampled at 32 kHz. To ensure the focus remains on instrumental synthesis, vocals were systematically removed using tags and the HT-Demucs source separation tool.
For the sound effect variant (Audio-MAGNeT), the dataset comprises curated clips from AudioSet, BBC Sound Effects, AudioCaps, Clotho v2, VGG-Sound, FSD50K, Free To Use Sounds, Sonniss Game Effects, WeSoundEffects, and Paramount Motion - Odeon Cinematic Sound Effects. Training for each model was conducted for up to 1 million steps to ensure robust generalization in both music and non-music audio domains, as outlined in the Audiocraft documentation.
Performance Metrics and Benchmarking
MAGNeT's performance has been rigorously assessed using both objective and subjective evaluation protocols. For objective measures in text-to-music generation, the model is benchmarked on the MusicCaps dataset, while its sound effect synthesis is evaluated via the AudioCaps benchmark.
Key metrics include Frechet Audio Distance (FAD), which assesses the statistical plausibility of generated audio, and Kullback-Leibler Divergence (KLD) computed on label distributions predicted by pre-trained audio classifiers. Additionally, alignment between generated audio and textual input is quantified with CLAP scores, leveraging the Contrastive Language-Audio Pretraining (CLAP) framework.
On the MusicCaps benchmark, MAGNeT-medium models typically achieve FAD scores in the range of 4.6, KLD values around 1.2, and text-audio CLAP alignment near 0.28, all competitive with or approaching results from autoregressive models such as MusicGen. In text-to-audio benchmarking, Audio-MAGNeT variants report FAD scores as low as 2.3 and KLD approaching 1.6.
MAGNeT distinguishes itself by inference speed, as the non-autoregressive architecture enables synthesis at up to seven times the speed of comparable autoregressive models, achieving interactive-level latency for many use cases. For example, the MAGNeT-small model generates ten-second samples in approximately four seconds, as detailed in the original study.
Model Variants and Release
MAGNeT is distributed in several configurations to suit different research and prototyping needs. Model scales include 300-million and 1.5-billion parameter variants, and separate checkpoints are provided for music and general audio synthesis. Some notable released checkpoints include "magnet-small-10secs," "magnet-medium-10secs," "magnet-small-30secs," "magnet-medium-30secs," as well as their audio-oriented counterparts, documented collectively on the Hugging Face model hub.
The model is released under the MIT License for code, with model weights distributed under a Creative Commons Attribution-NonCommercial 4.0 International (CC-BY-NC 4.0) license. The reference implementation is part of the Audiocraft library, which integrates the entire MAGNeT family alongside models such as MusicGen.
Applications, Limitations, and Responsible Usage
MAGNeT's open research release framework is intended to advance scientific understanding in the domain of text-to-audio generation. Applications include researching the boundaries of generative music and sound, enabling interactive tools for music generation within digital audio workstations, and serving as a pedagogical instrument for those entering the field of deep generative models.
However, the model presents several limitations. MAGNeT is not capable of synthesizing realistic vocals, given its removal of vocal content during training. It is optimized for prompts and descriptions in English and may underperform with non-English inputs. Furthermore, as with all generative models, synthesis quality and stylistic breadth are determined by the diversity and representativeness of the training data, which may introduce style and cultural biases, and potentially propagate unwanted artifacts or silences at sequence boundaries.
The creators caution that the model should not be used directly in downstream, high-stakes audio applications without thorough risk analysis, particularly in contexts where output quality, cultural neutrality, or content appropriateness are mission-critical. The model is distributed for non-commercial research and development, in keeping with its licensing terms, and was presented at ICLR 2024 as "version 1."
Helpful External Links