Browse Models
The simplest way to self-host Mixtral 8x7B. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Mixtral 8x7B is a 46.7B parameter language model using a Sparse Mixture of Experts architecture, where 8 expert networks dynamically route each input token to the 2 most relevant experts. It matches GPT-3.5's capabilities while using fewer active parameters, and excels at reasoning, math, and code tasks.
Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) language model that builds upon the architecture of Mistral 7B. The model employs a decoder-only design featuring eight feedforward blocks (experts) per layer, with a router network that dynamically selects two experts to process each token. This architecture results in a total of 46.7 billion parameters, though only 12.9 billion are actively used during inference for any given token, as detailed in the Mistral AI release announcement.
The model's technical specifications include a dimension of 4096, 32 layers, a head dimension of 128, a hidden dimension of 14336, 32 heads, 8 KV heads, and a context length of 32,768 tokens. It uses a vocabulary size of 32,000 and consistently selects the top 2 experts for processing, as outlined in the research paper.
The architecture's efficiency is one of its standout features - while the total parameter count is substantial at 47B, the active parameter usage of 13B during inference makes it more resource-efficient than comparable models. This design allows Mixtral 8x7B to achieve faster inference at low batch sizes and higher throughput at large batch sizes compared to similar-sized models.
Mixtral 8x7B was pre-trained on data from the open web, with experts and routers trained simultaneously. The model demonstrates strong multilingual capabilities, particularly in English, French, Italian, German, and Spanish. It supports an extensive context window of 32k tokens, making it suitable for processing longer documents and conversations.
The model shows particular strength in several key areas:
Analysis of the expert selection process reveals some syntactic patterns, though no clear domain specialization has been identified among the experts, according to the technical documentation.
Mixtral 8x7B has demonstrated impressive performance across various benchmarks, notably outperforming Llama 2 70B and achieving comparable or better results than GPT-3.5 on many standard evaluations. The fine-tuned instruction-following variant, Mixtral 8x7B - Instruct, has shown particularly strong results, scoring 8.3 on MT-Bench and exceeding the performance of several leading models including GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B's chat model in human evaluations.
The model exhibits reduced bias on the BBQ benchmark and shows more positive sentiment on BOLD compared to Llama 2. Its performance is particularly noteworthy given its efficient parameter usage, achieving these results while using significantly fewer active parameters than models like Llama 2 70B (13B vs 70B).
The model is available in safetensors format and can be deployed using various frameworks. It is compatible with the vLLM serving project and the Hugging Face Transformers library, though direct instantiation through Hugging Face requires specific implementation steps. The model can be optimized using Flash Attention 2 and supports various quantization options, including 8/4-bit quantization with bitsandbytes for reduced memory usage.
For deployment, the model leverages Megablocks CUDA kernels for efficient inference and can be deployed using Skypilot on cloud instances. As a base model, it should be noted that it lacks built-in moderation mechanisms.