Browse Models
The simplest way to self-host Tortoise TTS. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Tortoise TTS is a multi-component text-to-speech system combining autoregressive decoding, contrastive ranking, and diffusion models to generate natural speech. Trained on 50,000 hours of audio data, it can clone voices from reference clips and supports emotional tone control through prompts.
Tortoise TTS is a multi-voice text-to-speech system that prioritizes high-quality audio generation with realistic prosody and intonation. The model's architecture innovatively combines techniques from image generation, specifically using both autoregressive transformers and diffusion models, adapting these for speech synthesis. As detailed in the architectural design document, the system consists of five distinct models working in concert.
The core architecture comprises four main components:
A key innovation, dubbed the "Tortoise Trick," involves fine-tuning the DDPM on the autoregressive latent space to enhance both efficiency and output quality, as described in the research paper.
Tortoise TTS was trained on an extensive dataset of approximately 50,000 hours of speech data. This included:
The model's generation speed was initially slow, taking about 2 minutes per medium-sized sentence on a K80 GPU. However, significant performance improvements have been achieved, reaching a real-time factor (RTF) of 0.25-0.3 on 4GB VRAM and sub-500ms latency with streaming capabilities.
Version 2.1, released on May 2, 2022, introduced several key features:
The model offers multiple usage methods:
do_tts.py
for single phrase generationread.py
and read_fast.py
for processing larger text filessocket_server.py
for socket streamingVoice customization is achieved through reference audio clips, and the model includes prompt engineering capabilities for controlling emotional tone. Users can manipulate the voice latent space, enabling techniques like voice averaging.
The model's ability to generate highly realistic voices has raised ethical concerns regarding potential misuse. To address this, the developers provide a classifier model, tortoise-detect
, specifically designed to identify Tortoise-generated audio. The project encourages community feedback and collaboration to address these ethical considerations.
The model is licensed under the Apache 2.0 license, making it freely available for both research and commercial applications.