Browse Models

neonbjb /

Tortoise TTS

Family

Tortoise

Type

Foundation Model

License

Apache-2.0 License

Released

2022-04-26

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run Tortoise TTS using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Model Report

neonbjb / Tortoise TTS

Tortoise TTS is an open-source text-to-speech system that combines autoregressive and diffusion-based architectures to generate realistic speech from text. The model supports voice cloning through reference audio clips and can produce multi-voice synthesis with controllable prosody and emotion through prompt engineering techniques. Trained on approximately 50,000 hours of speech data using a combination of transformer and diffusion models, Tortoise employs a contrastive language-voice model for output ranking and includes a neural vocoder for final waveform synthesis.

Explore the Future of AI

Your server, your data, under your control

Tortoise TTS is an open-source text-to-speech (TTS) system designed to generate realistic, expressive, and context-sensitive speech from text. Developed primarily in Python, Tortoise supports multi-voice generation, prosody, and the ability to clone or generate a variety of vocal characteristics. The model leverages a combination of autoregressive and diffusion-based generation architectures, drawing from generative modeling principles applied within the speech domain. Its development, methodology, and performance metrics are documented in the paper "Better speech synthesis through scaling" and its GitHub repository.

Model Architecture and Technical Features

Tortoise TTS is constructed from five neural network models, primarily comprising an autoregressive decoder and a diffusion model. The system’s core parallels architectures used in generative modeling for images, adapted for speech generation.

The autoregressive prior employs a GPT-2-inspired transformer architecture, conditioned on both text prompts and “speech conditioning inputs,” which are typically one or more audio clips of a target speaker. These reference clips, converted to MEL spectrograms, are encoded and averaged to capture speaker-specific characteristics such as pitch, tone, cadence, and environmental qualities. For the final waveform synthesis, Tortoise uses a neural vocoder based on a Univnet implementation, converting MEL spectrograms into audio.

Tortoise introduces a “Tortoise Trick,” where the diffusion decoder is initially trained to map discrete speech codes to MEL spectrograms, then fine-tuned on the autoregressive latent space. This approach utilizes semantically rich representations, contributing to efficiency and output characteristics.

A contrastive language-voice pretrained transformer (CLVP), serving a function analogous to CLIP for images, is used to rank candidate outputs from the autoregressive model. This improves selection quality without requiring exhaustive sampling in the diffusion model.

Capabilities, Conditioning, and Unique Features

Tortoise supports multi-voice synthesis, enabling the generation of speech in arbitrary or cloned voices. By ingesting short reference audio segments, the system can mimic the target’s vocal traits. The model also supports the generation of synthetic voices through random sampling in its voice-conditioning latent space.

The model enables “prompt engineering”: by enclosing editorial directions in brackets within the input text (such as “[I am really sad,] Please feed me”), Tortoise interprets the bracketed context to alter emotion and tone without speaking those words. It can blend reference clips to create composite or “average” voices, and accepts conditioning latents directly as serialized files.

Both the autoregressive and diffusion decoders are conditioned on reference audio, enabling the transfer of vocal qualities, ranging from emotional inflection to environmental recording conditions. Output evaluation and selection are facilitated by the CLVP model, supporting efficient voice matching.

Training Data and Methodology

Tortoise was trained on a private “homelab” cluster using eight RTX 3090 GPUs for almost a year. Its primary dataset comprised approximately 50,000 hours of filtered speech, largely drawn from audiobooks and podcasts. Source material was automatically transcribed with a fine-tuned wav2vec2-large model to generate text-audio pairs. Initial experiments also utilized established datasets such as LibriTTS and HiFiTTS, totaling an additional 896 hours.

The training regime separated the learning process across the autoregressive decoder, the CLVP re-ranking model, and the diffusion decoder. Mel tokens and text were tokenized using custom byte-pair encoding, and nucleus sampling methodologies (such as DDIM for diffusion, and softmax techniques for autoregression) were employed to enable diversity and coherence in output. All models, including the vocoder, were trained together to ensure end-to-end synthesis.

Performance and Benchmarking

Early versions of Tortoise exhibited longer processing times. Following successive optimizations, the model now achieves a real-time factor of approximately 0.25–0.3 and latency of under 500 milliseconds when streaming on certain hardware configurations.

In terms of audio quality, Tortoise prioritizes intonation and natural prosody. Intelligibility and realism are influenced by its architectural design, training resource scale, and multimodal conditioning. Voice authenticity can be verified with a classifier (tortoise-detect), which demonstrates accuracy in distinguishing synthesized audio.

Limitations and Considerations

Tortoise is subject to several constraints. The model’s dependence on GPUs and its history of compute-intensive inference may affect accessibility. Furthermore, the training data was not curated for demographic balance, potentially leading to underrepresentation of minority voices or strong regional accents. The developer notes that larger-scale hardware and datasets could enhance model capability.

Additionally, several architectural and empirical issues persist. For instance, the autoregressive model’s use of fixed positional encodings limits the maximum speech length; the diffusion decoder may benefit from new feedforward components; and sample rate discrepancies across model components present consistency challenges. Ethical concerns relating to voice cloning and synthetic voice misuse are also recognized by the development team.

Applications

Tortoise TTS is intended for a broad spectrum of use cases, including audiobook narration, poetry reading, personalized TTS for accessibility, content creation, and research into voice cloning. Its prompt engineering and voice composition features enable control over the tone, style, and identity of generated speech. The model also supports experimentation in random or composite voice generation for synthetic vocal identities.

Release and Licensing

Tortoise TTS was first released publicly in its current lineage in May 2022 with version 2.1, which introduced architectural improvements, random voice generation, and support for user-provided conditioning latents. The system is released under the Apache-2.0 license.

Tortoise TTS

Laboratory OS

Direct Download

Explore the Future of AI

Your server, your data, under your control

Tortoise TTS

Laboratory OS

Direct Download

Explore the Future of AI

Your server, your data, under your control

Model Architecture and Technical Features

Capabilities, Conditioning, and Unique Features

Training Data and Methodology

Performance and Benchmarking

Limitations and Considerations

Applications

Release and Licensing

Helpful Links and Further Reading