Launch a dedicated cloud GPU server running Laboratory OS to download and run Whisper using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Experiment with various cutting-edge audio generation models, such as Bark (Text-to-Speech), RVC (Voice Cloning), and MusicGen (Text-to-Music).
Model Report
openai / Whisper
Whisper is an open-source automatic speech recognition model developed by OpenAI, built on a Transformer encoder-decoder architecture. Trained on 680,000 hours of multilingual audio data, it performs transcription, translation to English, and language identification across 98 languages. The model demonstrates robustness to accents and background noise, with multiple size variants available under MIT licensing.
Explore the Future of AI
Your server, your data, under your control
Whisper is an open-source, pre-trained automatic speech recognition (ASR) and speech translation model developed by OpenAI, designed to handle a broad range of tasks in many languages and acoustic conditions without the need for extensive fine-tuning. Leveraging a large dataset of supervised and weakly supervised audio, Whisper can perform multilingual transcription, translation, and language identification. The model exhibits robustness to accents, background noise, and technical vocabulary, and supports both short-form and long-form audio transcription by employing chunking strategies. Whisper is used in research and practical speech applications due to its flexibility and generalization across diverse data domains, as documented in the original technical report.
A detailed diagram of the Whisper model, illustrating its sequence-to-sequence Transformer architecture and multitask training format for transcription, translation, and language identification.
Whisper is built on a Transformer-based encoder-decoder architecture, also known as a sequence-to-sequence model. Audio inputs are first converted into log-Mel spectrograms via a feature extractor. The encoder processes these spectrograms, passing representations to the decoder, which generates output tokens. This architecture provides the foundation for supporting multiple tasks—transcription, translation, and language identification—by conditioning the decoder with specialized context tokens. These tokens determine the output task, such as whether the model should transcribe in the same language or translate the audio to English. Inference is typically performed using the generate() method, while the WhisperProcessor manages audio input preprocessing and output text decoding. The model is implemented using PyTorch, and tokenization is provided by a fast tiktoken implementation.
Training Data and Multitask Learning
Whisper’s capabilities are derived from its exposure to a large and diverse dataset of 680,000 hours of multilingual and multitask labeled audio sourced from web-scale data. This collection encompasses English transcription, non-English transcription, many-to-English speech translation, and voice activity detection, among other tasks. Approximately 65% of the data is English audio aligned with English transcripts, 18% consists of non-English audio paired with English text (for translation), and 17% consists of non-English audio with corresponding non-English transcriptions, spanning 98 languages. The model employs weakly supervised learning, as much of the data originated from noisy web sources, and multitask learning, training jointly on all tasks using a shared set of tokens. This approach enables Whisper to perform zero-shot translation and robust language identification by leveraging task tokens as contextual signals in the output sequence.
A variant, large-v2, was trained for additional epochs with increased regularization, resulting in improved performance over the original ‘large’ model configuration, without changes to the model’s architecture.
Performance and Evaluation
Whisper exhibits accuracy on standard ASR and speech translation benchmarks. On the LibriSpeech test-clean dataset, the ‘whisper-large’ model achieves a word error rate (WER) of 3.0%, and on the test-other split, a WER of 5.4%. However, performance is not uniform across languages; Word Error Rate and Character Error Rate correlate strongly with the amount of available training data for each language. On multilingual benchmarks such as Common Voice 15 and FLEURS, higher-resource languages exhibit lower error rates, while lower-resource languages show greater variance.
Bar charts of word/character error rates show Whisper 'large-v3' and 'large-v2' performance across languages on the Common Voice 15 and FLEURS datasets. Lower WER/CER indicates better accuracy in supported languages.
The model is robust against accents, background noise, and technical jargon, but accuracy may degrade for lower-resource or underrepresented languages, as well as for distinctive dialects and speaker demographics. Repetition in generated text can occur, especially in languages with less training data, though this effect can be partially mitigated by decoding strategies.
Applications
Whisper’s core function is automatic speech recognition, transcribing spoken audio to text. It also enables bidirectional speech translation, primarily into English, and can detect the language of spoken audio without explicit user input. The model lends itself to batch and long-form transcription tasks by leveraging chunking techniques in the Transformers ASR pipeline, supporting use cases such as captioning, archiving, and language accessibility tools. Researchers utilize Whisper to investigate the robustness, generalization, and limitations of contemporary speech models, as it provides a single model for tasks traditionally requiring specialized components.
mlk.flac
An example audio sample used to demonstrate Whisper's automatic speech recognition pipeline. [Source]
Potential extensions under fine-tuning include voice activity detection, speaker classification, and speaker diarization, though these applications have not been robustly validated in the base model.
Model Variants and Usage
Whisper is available in several pre-trained model sizes and variants, ranging from compact ‘tiny’ and ‘base’ models to ‘large’ and ‘large-v2’ configurations. Multilingual models support a wide array of languages, while .en-suffix models are tailored for English-only tasks. For high-accuracy English recognition, the .en models are recommended, especially in the smaller size ranges. Task and language behavior can be steered by setting decoder prompt IDs during inference, enabling forced or automatic language and task selection.
For audio exceeding the model’s default 30-second input window, chunking and batched inference techniques allow scalable long-form transcription with optional timestamp predictions. Fine-tuning guides are available for customizing performance on specific languages or domains, often requiring only small amounts of new labeled data.
Limitations and Ethical Considerations
Whisper exhibits certain known limitations, including occasional hallucination—predicting content not present in the audio—particularly due to its joint language modeling and speech transcription approach. Accuracy may vary widely between languages and is lower for those with sparse training data. The model can be sensitive to speaker accent, regional dialects, and may generate repetitive segments under certain conditions. Furthermore, Whisper is not designed for out-of-the-box real-time applications, but can serve as a building block in near-real-time systems.
Use in classification tasks such as speaker identification has not been rigorously tested and may require additional fine-tuning. Ethical issues have been raised surrounding the dual-use potential of ASR systems, particularly in surveillance scenarios, and users are cautioned against deploying Whisper for recordings without consent or for inferring subjective or demographic attributes.
Licensing
Whisper’s source code and pretrained model weights are distributed under the MIT License, promoting research, modification, and deployment with few restrictions.