Browse Models
The simplest way to self-host Llama 3.2 3B. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Llama 3.2-3B is a 3.21B parameter multilingual model released by Meta in 2024. Notable for Grouped-Query Attention architecture and SpinQuant optimization for mobile devices. Supports 8+ languages with knowledge through December 2023. Shows improved performance in reasoning, QA, and multilingual tasks vs previous versions.
The Llama 3.2-3B model represents Meta's latest advancement in multilingual large language models, featuring an optimized transformer architecture with auto-regressive capabilities. With 3.21 billion parameters, this model supports a diverse range of languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai, while maintaining capability in additional languages from its training set.
The model employs Grouped-Query Attention (GQA), enhancing its inference scalability across different deployment scenarios. Both pretrained and instruction-tuned variants are available, with the latter benefiting from supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to improve helpfulness and safety. The model's knowledge cutoff extends to December 2023, with its release occurring on September 25, 2024.
Particularly noteworthy is the model's optimization for on-device use through various quantization techniques. The implementation of SpinQuant and QAT with LoRA has yielded significant improvements in both inference speed and memory efficiency, especially on ARM CPUs. For instance, testing on an Android OnePlus 12 device demonstrated a 2.6x speedup in decoding compared to the BF16 baseline, while achieving a 60.3% reduction in model size. More details about these optimizations can be found in the SpinQuant paper.
Llama 3.2-3B has demonstrated impressive performance across various benchmarks, outperforming many open-source and closed-source chat models. The model shows particular strength in:
These benchmarks reflect significant improvements in general knowledge, reasoning capabilities, and multilingual performance compared to previous Llama iterations. The model's strong performance across these diverse evaluation metrics showcases its versatility for different applications.
Meta has placed significant emphasis on responsible deployment, providing comprehensive resources through their Responsible Use Guide. The model is protected under the Llama 3.2 Community License, a custom commercial license agreement that governs its use in both commercial and research applications.
The development process included extensive safety measures:
Meta actively encourages community engagement through multiple channels, including:
The model is well-suited for applications including chatbots, knowledge retrieval, summarization, and mobile AI assistants, with particular attention paid to efficient on-device performance.