Browse Models
The simplest way to self-host Vocos. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Vocos is a neural vocoder that generates Fourier spectral coefficients directly, rather than using time-domain synthesis. It matches HiFi-GAN quality while processing 13x faster, using ConvNeXt blocks and a novel unit circle activation function. Available in 13.5M and 7.9M parameter variants for different audio tasks.
Vocos is a fast neural vocoder that synthesizes audio waveforms from acoustic features using a novel approach to audio generation. Unlike traditional GAN-based vocoders that operate in the time domain, Vocos directly generates Fourier spectral coefficients which are then reconstructed via inverse Fourier transform, as detailed in the original research paper. This architectural choice leverages the inductive bias of time-frequency representations, aligning more closely with human auditory perception while benefiting from computationally efficient fast Fourier transform algorithms.
The model's architecture employs ConvNeXt blocks instead of dilated convolutions, maintaining the same temporal resolution across all layers. A key innovation is its novel activation function, defined in terms of a unit circle, which implicitly handles phase wrapping for accurate phase angle estimation. The model utilizes both a multi-period discriminator (MPD) and a multi-resolution discriminator (MRD) during training, implementing a hinge loss formulation for the adversarial loss.
Vocos was primarily trained on the LibriTTS dataset, utilizing both the 'train-clean' and 'train-other' subsets at 24kHz sampling rate. The training process involved generating mel-scaled spectrograms with specific parameters (nfft=1024, hop_n=256, 100 Mel bins) and running for 2 million iterations split equally between the generator and discriminator. The training utilized the AdamW optimizer with a cosine learning rate schedule.
Performance benchmarks demonstrate that Vocos achieves state-of-the-art audio quality comparable to BigVGAN, with superior results in VISQOL and PESQ metrics. The model shows exceptional generalization capabilities, particularly with out-of-distribution audio from the MUSDB18 dataset, where it outperforms competitors like HiFi-GAN, iSTFTNet, and BigVGAN. Perhaps most notably, Vocos demonstrates remarkable speed improvements, operating approximately 13 times faster than HiFi-GAN and 70 times faster than BigVGAN, with the speed advantage being particularly pronounced in scenarios without GPU acceleration.
Two pre-trained variants of Vocos are available:
charactr/vocos-mel-24khz
: Trained on LibriTTS with 1 million iterations, containing 13.5 million parameterscharactr/vocos-encodec-24khz
: Trained on the DNS Challenge dataset with 2 million iterations, containing 7.9 million parameters and capable of reconstructing audio from EnCodec tokensThe model can be easily installed via pip, with two installation options:
pip install vocos
pip install vocos[train]
Vocos has demonstrated versatility in various applications, including serving as a neural audio codec (outperforming EnCodec across various bandwidths) and functioning as a drop-in replacement vocoder for the Bark text-to-speech model. The model is available under the MIT license and can be accessed through Hugging Face's from_pretrained
function.