Browse Models
The simplest way to self-host Demucs. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Demucs v4 (HT Demucs) is an audio source separation model using a bi-U-Net architecture that processes sound in both time and frequency domains. Its cross-domain Transformer encoder with sparse attention enables effective separation of music into drums, bass, vocals, and other instruments, achieving 9.20 dB SDR on MUSDB18.
Demucs is a state-of-the-art music source separation model designed to separate audio into individual components like drums, bass, and vocals from accompanying instruments. The latest version, Demucs v4 (also known as Hybrid Transformer Demucs or HT Demucs), represents a significant advancement in the architecture, building upon previous versions by incorporating a cross-domain Transformer Encoder into the existing Hybrid Demucs framework.
The model's architecture is based on a bi-U-Net structure that operates in both time and spectral domains. The key innovation in HT Demucs is the replacement of innermost layers with a cross-domain Transformer Encoder that utilizes self-attention within each domain and cross-attention between them, as detailed in the HT Demucs research paper. This encoder has a depth of 5, with interleaved self-attention and cross-attention layers. The Transformer component has an input/output dimension of 384, employs 8 attention heads, and uses a feed-forward network with a hidden state size four times the transformer dimension.
To handle longer audio sequences effectively, the architecture implements sparse attention kernels and Locally Sensitive Hashing (LSH), resulting in a variant called Sparse HT Demucs. The model also incorporates 1D and 2D sinusoidal encodings in its inputs to enhance spatial understanding.
HT Demucs was primarily trained on the MUSDB18 dataset, supplemented with an additional 800 curated songs. The training process employed various data augmentation techniques, including repitching, tempo stretching, and remixing. The model uses L1 loss on waveforms and is optimized using the Adam optimizer.
Performance-wise, HT Demucs has achieved remarkable results. When trained with additional data, it surpasses the original Hybrid Demucs by 0.45 dB in Signal-to-Distortion Ratio (SDR). The Sparse HT Demucs variant, combined with per-source fine-tuning, achieves a state-of-the-art SDR of 9.20 dB on the MUSDB dataset. This performance significantly exceeds other models in the field, including Wave-U-Net, Open-Unmix, D3Net, and Band-Split RNN (which achieves 8.9 dB SDR with additional unsupervised data).
The Demucs family includes several variants:
Installation is straightforward via pip: python3 -m pip install -U demucs
. The model offers various usage options, including:
-n
flag