Browse Models
Vocos is a groundbreaking family of neural vocoders developed by GemeloAI, first released in October 2023. The model family represents a significant advancement in audio synthesis technology, introducing a novel approach to generating audio waveforms from acoustic features. Unlike traditional GAN-based vocoders that operate in the time domain, Vocos employs a unique architecture that directly generates Fourier spectral coefficients, which are then reconstructed through inverse Fourier transform, as detailed in the original research paper.
The Vocos family's architectural design marks a departure from conventional approaches in neural audio synthesis. At its core, the model leverages the inductive bias of time-frequency representations, aligning more closely with human auditory perception while simultaneously benefiting from the computational efficiency of fast Fourier transform algorithms. This innovative approach has proven to be transformative in terms of both quality and performance.
A distinguishing feature of the Vocos architecture is its use of ConvNeXt blocks instead of traditional dilated convolutions, maintaining consistent temporal resolution across all layers. The model family introduces a groundbreaking activation function defined in terms of a unit circle, which effectively handles phase wrapping for accurate phase angle estimation. This technical innovation has proven crucial for high-quality audio synthesis.
The training architecture incorporates both a multi-period discriminator (MPD) and a multi-resolution discriminator (MRD), implementing a hinge loss formulation for the adversarial loss. This dual-discriminator approach contributes to the model's ability to generate highly realistic audio outputs while maintaining computational efficiency.
The Vocos family currently consists of two primary variants, each optimized for specific use cases while sharing the core architectural innovations:
Vocos-mel-24khz represents the standard variant, trained on the LibriTTS dataset with 1 million iterations. This model contains 13.5 million parameters and is optimized for general-purpose audio synthesis tasks. It demonstrates exceptional performance in converting mel spectrograms to high-quality audio waveforms.
Vocos-encodec-24khz is a specialized variant trained on the DNS Challenge dataset with 2 million iterations. Containing 7.9 million parameters, this model is specifically designed to reconstruct audio from EnCodec tokens, making it particularly suitable for audio compression and reconstruction applications.
The development of the Vocos family has been characterized by rigorous training methodologies and careful optimization. The primary training process utilized the LibriTTS dataset, specifically incorporating both the 'train-clean' and 'train-other' subsets at 24kHz sampling rate. The training protocol involved generating mel-scaled spectrograms with specific parameters (nfft=1024, hop_n=256, 100 Mel bins).
The training process was conducted over 2 million iterations, divided equally between the generator and discriminator components. The models utilize the AdamW optimizer with a cosine learning rate schedule, which has proven effective in achieving stable convergence and optimal performance.
The Vocos family has demonstrated exceptional performance across various metrics and use cases. In benchmark testing, the models achieve state-of-the-art audio quality comparable to BigVGAN, with superior results in VISQOL and PESQ metrics. A particularly notable achievement is the models' generalization capabilities, especially when handling out-of-distribution audio from the MUSDB18 dataset, where they consistently outperform competitors including HiFi-GAN, iSTFTNet, and BigVGAN.
Perhaps the most significant advantage of the Vocos family is its computational efficiency. The models operate approximately 13 times faster than HiFi-GAN and 70 times faster than BigVGAN, with this speed advantage becoming even more pronounced in scenarios without GPU acceleration. This exceptional performance makes the Vocos family particularly suitable for real-time applications and resource-constrained environments.
The Vocos model family has demonstrated remarkable versatility across various audio processing applications. One notable application is its use as a neural audio codec, where it has shown superior performance compared to EnCodec across various bandwidths. The models have also proven effective as drop-in replacement vocoders for the Bark text-to-speech model, demonstrating their adaptability to different audio synthesis pipelines.
The models' accessibility is enhanced through straightforward installation options via pip, with separate installations available for basic inference (pip install vocos
) and training capabilities (pip install vocos[train]
). This flexibility in deployment options has contributed to the models' widespread adoption in both research and practical applications.
Since its release, the Vocos family has made a significant impact on the field of audio synthesis and processing. The models' combination of high-quality output, computational efficiency, and versatility has set new standards for neural vocoders. The open availability of the models under the MIT license and their integration with popular frameworks like Hugging Face have further contributed to their adoption and influence.
The success of the initial Vocos variants suggests potential for future development and expansion of the model family. Areas for possible advancement include adaptation to different sampling rates, optimization for specific audio domains, and further improvements in computational efficiency.
For those interested in exploring the Vocos family further, comprehensive documentation and resources are available through several channels. The original research paper provides detailed technical information about the model architecture and methodology. Audio samples demonstrating the models' capabilities can be found on the official demo page, while implementation details and source code are available in the GitHub repository. Pre-trained models can be accessed through the Hugging Face model hub.