Browse Models
The simplest way to self-host Whisper. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Whisper is OpenAI's multilingual speech recognition model that converts audio to text across multiple languages. Built on a transformer architecture and trained on 680,000 hours of data, it offers variants from 39M to 1.55B parameters. Notable for strong zero-shot performance and direct speech-to-English translation capabilities.
Whisper is a general-purpose speech recognition model introduced by OpenAI that employs a Transformer-based encoder-decoder (sequence-to-sequence) architecture. The model was trained on 680,000 hours of multilingual and multitask supervised data collected from the web, enabling robust performance across diverse audio conditions without requiring fine-tuning. Released in September 2022, Whisper represents a significant advancement in automatic speech recognition (ASR) technology, as detailed in the original research paper.
The model processes audio in 30-second chunks, converting them into log-Mel spectrograms before feeding them through the encoder. The decoder then predicts the corresponding text, utilizing special tokens to manage various tasks including language identification, timestamping, multilingual transcription, and translation to English. This multitask approach effectively replaces multiple stages of traditional speech processing pipelines with a single unified model.
The Whisper family includes several model sizes, each offering different trade-offs between speed and accuracy:
The smaller four sizes (tiny through medium) are available in both English-only and multilingual versions, while the large models are exclusively multilingual. English-only variants generally demonstrate superior performance, particularly in the smaller size categories.
Whisper supports multiple languages, including English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Polish, and many others. The model can perform both transcription (output in the same language as input) and translation (output in a different language) tasks. Performance across languages correlates strongly with the amount of training data available for each language, with approximately one-third of the training data being non-English.
For implementation, Whisper utilizes a WhisperProcessor
for audio pre- and post-processing, while the WhisperFeatureExtractor
handles the extraction of mel-filter bank features from raw audio. The model supports various optimizations including PyTorch Scaled Dot Product Attention (SDPA), Flash Attention 2, and torch.compile
for improved inference speed.
In terms of performance, Whisper achieves a 3% Word Error Rate (WER) on the LibriSpeech test-clean benchmark with the large model. While it may not surpass models specifically optimized for individual benchmarks, Whisper's zero-shot performance across diverse datasets demonstrates significantly improved robustness, with a 50% reduction in errors compared to other models. It particularly excels in speech-to-text translation, outperforming supervised state-of-the-art models on CoVoST2 to English translation in zero-shot settings.
The model does have some limitations, including potential hallucinations (generating text not present in the audio) and uneven performance across languages and accents. The creators caution against using it without consent or for subjective classification, especially in high-risk domains.