Vocos is a neural vocoder designed for audio synthesis with a focus on spectral efficiency and reconstruction fidelity. It utilizes an approach based on spectral coefficient generation rather than directly modeling time-domain waveforms. By operating in the Fourier domain and employing a single forward pass through its neural architecture, Vocos enables the reconstruction of audio signals. The model is open-sourced under the MIT license and supports both mel-spectrogram and neural audio codec token inputs, making it applicable across various speech and music generation tasks, as detailed in the arXiv publication and GitHub repository.
Model Architecture and Methodology
Vocos adopts a Fourier-based architecture that processes audio data differently from time-domain generative models. Instead of directly producing audio samples, the model generates Short-Time Fourier Transform (STFT) spectral coefficients, utilizing the fast inverse STFT for waveform synthesis. This design aims to align the reconstruction process with auditory perception and addresses certain artifacts found in time-domain approaches.
A notable feature is the absence of transposed convolutions, which are traditionally used for upsampling but can introduce aliasing effects. Instead, upsampling is executed via the computationally efficient inverse STFT. Vocos maintains an isotropic or constant temporal resolution throughout its network, which simplifies phase estimation and supports streamlined signal processing.
The model incorporates a generator backbone based on ConvNeXt blocks, contributing to its processing capabilities and training stability. The generator outputs tensors split into magnitude and phase components, with the phase estimated via a custom function on the unit circle to ensure consistent phase unwrapping. Discrimination employs both multi-period and multi-resolution discriminators, as established in the source literature for adversarial training.
Training Data and Optimization
For reconstructing audio from mel-spectrograms, Vocos is trained on the LibriTTS dataset, encompassing both train-clean
and train-other
subsets at a 24 kHz sampling rate. Input spectrograms are computed with a 1024-point FFT and 256-sample hop, incorporating 100 Mel bins. Random gain augmentation between -1 and -6 dBFS enhances robustness to input variation.
When adapted for neural audio codec reconstruction from EnCodec tokens, Vocos is trained on clean speech segments from the DNS Challenge, supporting a range of bandwidth settings.
The training regime spans up to 2 million iterations, employing AdamW optimization with cosine-decayed learning rates. Audio samples are randomly cropped, and feature matching and adversarial losses (using a hinge formulation) complement the mel-spectrogram reconstruction objective.
Performance and Benchmarking
Vocos demonstrates performance metrics across standard objective and subjective evaluation metrics, for both in-distribution and out-of-distribution data. On tasks reconstructing speech from mel-spectrograms, the model achieves a UTMOS mean opinion score of 3.734, in comparison to BigVGAN (3.749) and other models assessed by metrics such as PESQ (3.70) and VISQOL (4.66). The model also attains a high periodicity F1 score and reconstruction fidelity for harmonics.
Subjective evaluations indicate a naturalness MOS of 3.62 ± 0.15 and similarity MOS of 4.55 ± 0.15, both within range of established GAN-based vocoders. On the MUSDB18 music dataset, Vocos achieves high VISQOL scores across diverse audio content categories, demonstrating generalization beyond speech.
Compared to neural audio codecs, Vocos as a replacement vocoder shows improved perceptual quality when reconstructing from EnCodec tokens, particularly at higher bandwidths. For example, at 12 kbps, Vocos reports a UTMOS of 3.882 versus EnCodec's 3.765, and a subjective MOS of 4.00 ± 0.16 compared to 3.08 ± 0.19.
Computational efficiency is a characteristic of this model. Leveraging the inverse STFT rather than transposed convolutions, Vocos processes audio up to 13 times faster than HiFi-GAN and approximately 70 times faster than BigVGAN, achieving 6696.52 times real-time throughput on GPU and 169.63 times real-time on CPU. The model contains 13.5 million parameters in its mel-spectrogram variant, and 7.9 million in its EnCodec reconstruction form, which aligns with the scale of other vocoders. Preliminary trials with a Modified Discrete Cosine Transform (MDCT) variant indicated performance differences in STFT representation for generative contexts, as the MDCT version showed lower perceptual quality scores.
Applications and Integration
Vocos is employed primarily as a vocoder for audio waveform synthesis from acoustic features. In speech synthesis pipelines, it reconstructs waveforms from mel-spectrograms, serving a role in end-to-end speech and music generation systems. Its compatibility with EnCodec tokens enables direct reconstruction from quantized neural audio representations, providing a drop-in replacement for vocoders in models like Bark that use discrete audio tokens.
Through its open-source release, Vocos supports a range of research and production applications in text-to-speech, neural audio coding, music information retrieval, and other fields where waveform generation efficiency and fidelity are relevant.
Variants, Limitations, and Release History
The primary variants of Vocos include its mel-spectrogram-trained model and an EnCodec-trained model, each optimized for different feature domains. Both possess a moderate parameter footprint, reflecting a balance between model capabilities and efficiency. The mel-spectrogram variant is trained on extensive speech data, while the EnCodec model is tailored for use as a neural audio codec at multiple bitrates.
A difference in performance was observed with the MDCT-based variant versus the STFT-based design in waveform synthesis.
Vocos was initially released as version 0.1.0 in October 2023. Its research paper was first made available on arXiv in June 2023 and subsequently published at ICLR 2024.
External Resources