Launch a dedicated cloud GPU server running Laboratory OS to download and run Mixtral 8x7B using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.
The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.
Model Report
Mistral AI / Mixtral 8x7B
Mixtral 8x7B is a sparse Mixture of Experts language model developed by Mistral AI and released under the Apache 2.0 license in December 2023. The model uses a decoder-only transformer architecture with eight expert networks per layer, activating only two experts per token, resulting in 12.9 billion active parameters from a total 46.7 billion. It demonstrates competitive performance on benchmarks including MMLU, achieving multilingual capabilities across English, French, German, Spanish, and Italian while maintaining efficient inference speeds.
Explore the Future of AI
Your server, your data, under your control
Mixtral 8x7B is a large language model (LLM) created by Mistral AI and released on December 11, 2023. It is designed as a generative Sparse Mixture of Experts (SMoE) model, making use of a transformer-based architecture that enables efficient scaling and performance across multiple domains. Mixtral 8x7B is distributed under the Apache 2.0 license, supporting both academic and commercial applications.
A table showing how Mixtral 8x7B compares on key benchmarks with LLaMA 2 70B and GPT-3.5 across reasoning, coding, and general understanding tasks.
Mixtral 8x7B adopts a decoder-only transformer structure. Its distinguishing feature is the integration of Mixture-of-Experts (MoE) layers, where each transformer layer utilizes eight distinct feedforward blocks, or “experts.” For every token processed, a router network dynamically selects two of these experts to compute and combine outputs additively. This results in a total model capacity of 46.7 billion parameters, with only 12.9 billion parameters active per-token during inference—a configuration that manages cost and latency while enhancing representational power. The combination of high total parameter count and low per-token activation builds on the architecture previously established in Mistral 7B, but leverages SMoE design for greater efficiency.
Specialized execution kernels, such as Megablocks, enable these sparse computations to be efficient even on commodity hardware by expressing MoE operations as sparse matrix multiplications. Mixtral layers utilize the SwiGLU activation function for their expert functions, with the number of selected experts per token (top_k) set to two.
Typical model hyperparameters include a dimensionality of 4096, 32 layers, head dimension of 128, hidden dimension of 14,336, and support for context windows up to 32,768 tokens. The model’s vocabulary consists of 32,000 tokens, further facilitating multilingual performance.
Training Data and Methodology
Mixtral 8x7B is pre-trained on a diverse dataset derived from large-scale web corpora, with increased emphasis on multilingual content relative to its predecessor Mistral 7B, in order to bolster performance across English, French, Italian, German, and Spanish. During pre-training, both the experts and routers are trained in tandem, optimizing for dynamic expert allocation as described in Mistral's technical release.
Instruction-tuned variants, such as Mixtral 8x7B Instruct, are created using supervised fine-tuning (SFT) with curated instruction datasets, followed by Direct Preference Optimization (DPO) using paired human feedback. Analysis of internal routing dynamics reveals early and late layers assign tokens to experts primarily based on syntactic features, while mid-layer divergences reflect more specialized tasks.
This table presents Mixtral 8x7B's benchmarking across multiple evaluation domains, comparing its results to other leading models.
Mixtral 8x7B demonstrates competitive performance on a wide range of standard natural language processing benchmarks, often matching or exceeding much larger models such as Llama 2 70B and GPT-3.5. For example, it reports an MMLU score of 70.6% (Multitask Language Understanding), surpassing both Llama 2 70B and GPT-3.5 in several cases. On reasoning and knowledge-intensive benchmarks such as HellaSwag and ARC Challenge, Mixtral delivers performance on par with, or superior to, contemporaries of higher parameter count.
The architecture’s efficiency is reflected in both speed and throughput: by restricting active parameters per token, Mixtral 8x7B achieves inference speeds approximately six times faster than Llama 2 70B under similar settings. This makes it suitable for both low-latency and high-throughput scenarios.
Math and code generation benchmarks also highlight Mixtral’s capabilities, where it outperforms Llama 2 70B—including on MBPP (60.7% vs. 49.8%) and GSM8K (58.4% vs. 53.6%). In instruction-following tasks, the Mixtral 8x7B Instruct model attains a score of 8.3 on MT-Bench, ranking above GPT-3.5 Turbo and comparable models on the LMSys Leaderboard as of late 2023.
Charts visualizing quality versus inference budget for Mixtral 8x7B and peer models in MMLU, Knowledge, Reasoning, Math, and Code benchmarks.
Multilingual proficiency is a characteristic of Mixtral 8x7B. During benchmarking for French, German, Spanish, and Italian using ARC-c, HellaSwag, and MMLU, Mixtral consistently matches or outperforms Llama 2 70B and other major peers. Multilingual benchmarks indicate utility for international communication and content generation.
Comparative results across French, German, Spanish, and Italian benchmarks highlight Mixtral 8x7B's strong multilingual abilities.
Mixtral 8x7B’s bias profile has also been evaluated systematically. On the BBQ (Bias Benchmark for QA) and BOLD (Bias in Open-Ended Language Generation Dataset) benchmarks, Mixtral exhibits measurable improvements over Llama 2 70B, showing lower bias and more balanced sentiment. For example, Mixtral achieves a 56.0% accuracy on BBQ compared to Llama 2 70B's 51.5%, and similar or better performance on multiple BOLD subcategories.
Detailed bias performance comparison between Mixtral 8x7B and Llama 2 70B on BBQ and BOLD metrics.
Mixtral 8x7B is well-suited for a variety of language-related applications, ranging from text content generation, code completion in development workflows, and instruction-following for chat-based agents, to multilingual tasks in international communication. The base model is pretrained and can be further fine-tuned for specialized applications, such as moderation or task automation using established protocols.
Despite its efficiency, Mixtral's total memory footprint remains substantial relative to its sparse active-parameter count, due to the need to store all expert parameters. The MoE routing mechanism introduces additional overhead, particularly when handling small batches, and is therefore most efficient at larger batch sizes. The pretrained base model does not contain built-in moderation, and, as such, applications requiring output safeguards benefit from fine-tuning or prompt-level guardrailing, as described in Mistral AI’s moderation guidelines.
Mixtral 8x7B forms part of a model family alongside Mistral 7B, which shares a similar transformer architecture but does not implement MoE layers. In benchmark comparisons, Mixtral’s SMoE configuration results in distinct advantages in scalability and multilingual performance.
Release Information and Licensing
Mixtral 8x7B was officially released on December 11, 2023. Shortly afterward, in January 2024, Mistral AI published the preprint describing Mixtral’s technical approach for the broader research community. The model architecture, weights, and code are made available for unrestricted use under the Apache 2.0 license, permitting broad adoption across both open research and commercial environments.