Launch a dedicated cloud GPU server running Laboratory OS to download and run Demucs using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Experiment with various cutting-edge audio generation models, such as Bark (Text-to-Speech), RVC (Voice Cloning), and MusicGen (Text-to-Music).
Model Report
adefossez / Demucs
Demucs is an audio source separation model that decomposes music tracks into constituent stems such as vocals, drums, and bass. The latest version (v4) features Hybrid Transformer Demucs architecture, combining dual U-Nets operating in time and frequency domains with cross-domain transformer attention mechanisms. Released under MIT license, it achieves competitive performance on MUSDB HQ benchmarks for music production and research applications.
Explore the Future of AI
Your server, your data, under your control
Demucs is a generative AI model developed for the task of music source separation: the process of decomposing audio tracks into their constituent stems, such as vocals, drums, bass, and other instruments. The latest iteration, Demucs v4, introduces the Hybrid Transformer Demucs (HT Demucs) architecture, which leverages innovations in both time-domain and frequency-domain signal processing, incorporating transformer-based attention mechanisms. The model incorporates a hybrid dual U-Net design and demonstrates empirical performance on established evaluation datasets, including MUSDB HQ. Released under an MIT license, Demucs has become a reference system for both academic research and practical music production applications.
Diagram of the Hybrid Transformer Demucs architecture, showcasing dual U-Nets for time and frequency domains, interconnected by a cross-domain transformer encoder.
The HT Demucs architecture builds upon the foundation established by previous versions of Demucs. At its core, the model consists of two distinct U-Net structures operating in parallel: one dedicated to the time domain (processing raw waveforms through temporal convolutions) and one to the frequency domain (processing spectrogram representations through convolutions across frequency bins). This dual-branch approach enables the model to exploit the complementary strengths of both representations.
A central innovation in HT Demucs is the introduction of a cross-domain Transformer encoder, situated at the innermost layers where high-level features from both branches converge. The Transformer encoder employs self-attention within each domain and cross-attention between domains, allowing the model to dynamically integrate temporal and spectral cues. This facilitates more flexible handling of diverse musical elements and supports longer temporal context windows compared to earlier models. The architecture also incorporates sparse attention kernels powered by Locally Sensitive Hashing (LSH), which enable the model to extend its receptive field while mitigating the memory demands typically associated with attention mechanisms. With this architecture, HT Demucs can process input segments of up to 12.2 seconds during training, a notable improvement over models restricted to much shorter contextual spans.
Training Procedure and Datasets
HT Demucs has been trained using a combination of benchmark and proprietary datasets. The principal public dataset employed is MUSDB18-HQ, a widely-used corpus comprising 150 professionally mixed tracks with corresponding source stems. To supplement this, developers curated an internal dataset of 800 songs, carefully selected to ensure high stem quality and accurate source labeling. Dataset curation involved requirements such as a minimum active time threshold for stems and automated checks for stem accuracy using pre-trained models.
Model training utilized data augmentations tailored for music, including pitch shifting, tempo stretching, and remixing of stems within individual batches. Data augmentation was shown to be critical for model robustness, particularly remixing, with ablation studies indicating up to a 0.7 dB drop in separation performance without it. Training was performed using the Adam optimizer, with L1 waveform loss, high batch counts, and extended epochs to ensure convergence. Fine-tuning on individual sources, such as vocals or drums, was also explored to boost per-stem separation quality.
Performance and Benchmarking
HT Demucs demonstrates competitive results on standard music source separation benchmarks. On the MUSDB HQ test set, the model achieves a Signal-to-Distortion Ratio (SDR) of up to 9.20 dB when utilizing sparse attention kernels and per-source fine-tuning. With training exclusively on MUSDB18-HQ and 800 additional curated songs, HT Demucs systematically outperforms earlier Demucs variants and competing systems like Band-Split RNN, Spleeter, and D3Net.
Separation quality is further supported by subjective evaluation via Mean Opinion Score (MOS), which assesses both perceived audio quality (absence of artifacts) and contamination (leakage between sources). Previous versions such as Hybrid Demucs (v3) achieved high scores in these categories, as reported in the original research publication.
Demucs Output: test.mp3
Example of output audio from Demucs. The model separated this track into distinct stems. [Source]
Applications and Use Cases
Demucs is principally used in music production, research, and educational contexts. Its separation of full tracks into constituent stems enables a variety of downstream workflows such as remixing, sampling, karaoke backing track creation, and detailed musicological analysis. The model serves as an experimental baseline for continued research in audio source separation, supports algorithmic evaluation frameworks, and is integrated into both real-time and batch processing tools for end users. Specialized modes, such as isolating only vocal stems or removing vocals for karaoke, further extend its utility.
Limitations and Known Issues
Despite its performance, Demucs and its transformer-based design possess inherent limitations. Transformers require substantial amounts of labeled data to achieve optimal results; for instance, the improvement over pre-transformer architectures relies on inclusion of extensive extra training material beyond public datasets. The model’s attention mechanisms are memory-intensive, especially as input segment lengths increase, necessitating optimizations such as sparse attention paradigms. Moreover, experimental 6-source models, which add sources like piano and guitar, currently exhibit artifacts and stem bleeding, particularly in the piano source. Fine-tuned models can require considerably more processing time relative to base models, and practical deployment on limited-hardware environments may require adjustments to batch sizes or input segment lengths.
Model Evolution and Family
Demucs v1 and v2 utilized time-domain U-Nets with bidirectional LSTM bottlenecks to perform source separation. The introduction of Hybrid Demucs (v3) marked a shift to bi-U-Net hybrid domain operation, which laid the groundwork for transformer integration. Each subsequent generation incorporated enhancements in architecture, training techniques, and available datasets. Hybrid Transformer Demucs, as presented in Demucs v4, integrates multi-branch convolutional modeling with cross-domain transformer attention, leading to improved separation metrics on varied audio material. Documentation and codebase, including implementation and training details, are available on the official GitHub repository.
Licensing
Demucs is made available under MIT licensing terms, allowing both academic and commercial use with minimal restrictions.