Launch a dedicated cloud GPU server running Laboratory OS to download and run Gemma 3 12B using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.
The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.
Model Report
Google / Gemma 3 12B
Gemma 3 12B is a multimodal, instruction-tuned language model developed by Google DeepMind that processes both text and images to generate text outputs. The model features a decoder-only transformer architecture with a 400-million-parameter vision encoder and supports context windows up to 128,000 tokens. Trained on 12 trillion tokens across over 140 languages using knowledge distillation and reinforcement learning techniques, it demonstrates capabilities in mathematics, coding, and vision-language tasks while offering quantized variants for resource-efficient deployment.
Explore the Future of AI
Your server, your data, under your control
Gemma 3 12B is a multimodal, instruction-tuned large language model developed by Google DeepMind as part of the Gemma family of open, efficient models. Designed to process both text and images as inputs while generating text outputs, Gemma 3 12B incorporates vision-language capability, long-context handling, and multilingual understanding. The model architecture, training methodologies, and benchmark evaluations are documented in detail in the Gemma 3 technical report.
Gemma 3 12B is capable of processing images and text jointly, as illustrated in this image of candies marked with a turtle—used in example prompts to test visual understanding.
Gemma 3 12B utilizes a decoder-only transformer framework closely aligned with the prior Gemma model families, but extends its structure in several important dimensions. The model incorporates a vision encoder—a 400 million parameter SigLIP variant of Vision Transformer—designed specifically for efficient soft token generation from images. This vision encoder is kept frozen during language model training and shared across multiple Gemma 3 variants for architectural consistency.
To support flexible image resolutions and aspect ratios, Gemma 3 employs a Pan & Scan method inspired by LLaVA, segmenting images into non-overlapping 896x896 crops. At inference, each image is transformed into 256 vision-language vectors, reducing computational load and memory requirements for downstream processing. This workflow enhances the model's ability to correctly interpret textual data embedded in images and accommodates a range of visual inputs.
The model supports context windows up to 128,000 tokens. This is achieved through an architectural modification that interleaves five local self-attention layers (each spanning 1,024 tokens) for every global self-attention layer, a design that confines full-context memory access to global layers while reducing memory overhead—a critical requirement for the expansive context length.
Quantization is incorporated via Quantization Aware Training (QAT), with official releases including raw and quantized checkpoints—such as int4 and fp8 formats—supporting deployment in environments with constrained resources. For example, the 12B model's memory footprint may be reduced from 24.0 GB (bfloat16) to 6.6 GB (int4) as described in the technical documentation.
Training Data, Multilingual Coverage, and Distillation
Gemma 3 12B is trained on a mixture of 12 trillion tokens, encompassing multilingual text, code, mathematics, and image-text pairs. The training dataset leverages both monolingual and parallel corpora, with specific emphasis on increasing the volume of non-English data to expand language support to over 140 languages. An improved SentencePiece tokenizer—shared with the Gemini 2.0 models—features 262,000 subword tokens, optimizing vocabulary distribution across both English and non-English text.
Quality control is prioritized through stringent data filtering, utilizing systems such as Google Cloud Sensitive Data Protection to remove personally identifiable and unsafe content. Evaluation sets are carefully decontaminated to prevent data leakage.
All Gemma 3 models are trained using knowledge distillation, in which the model—serving as the student—learns from the distribution of a more capable teacher model's logits. Post-training refinement uses advanced reinforcement learning techniques, such as improved versions of BOND, WARM, and WARP, incorporating reward functions crafted to amplify helpfulness, reasoning, math skills, code generation, and multilingual competence, while minimizing potential for harmful outputs. Quantization Aware Training further prepares models for resource-efficient inference.
Gemma 3 12B is optimized and trained on TPUv4 infrastructure using the JAX framework, with large-scale model state sharding handled via ZeRO-3 optimization.
Performance and Benchmark Evaluation
Gemma 3 12B has been evaluated across a range of standardized benchmarks, with results summarized in the Gemma 3 technical report. In mathematics, the instruction-tuned variant achieves an MATH benchmark score of 83.8, showing improvement compared to earlier models like Gemma 2 27B. Similarly, factual reasoning and coding abilities show results on datasets such as HiddenMath and LiveCodeBench.
Performance on vision-language tasks has also been measured. On the MultiModal Multi-task Understanding (MMMU) validation, Gemma 3 12B attains a score of 59.6, indicating multimodal comprehension. The model's extensive context handling contributes to results in long-document question answering and reasoning under large contexts.
For pre-trained models, Gemma 3 12B delivers accuracy scores across benchmarks such as HellaSwag, BoolQ, PIQA, SocialIQA, and TriviaQA, with multilingual capabilities validated by results on MGSM, Global MMLU-Lite, WMT24++, and FLoRes. Vision-language evaluation comprises tasks like COCOcap, DocVQA, and TextVQA, indicating image-text understanding.
A cosmos flower with a bumblebee, used in Gemma 3 code samples to demonstrate visual question answering. Such image-text inputs are supported natively by Gemma 3 12B.
Gemma 3 12B is designed to address a diverse array of natural language and vision-language tasks, including open-ended text generation, conversational AI, text summarization, and detailed image-based information extraction. Its vision-language architecture supports answering questions about images, describing visual scenes, and extracting embedded textual information, making it suitable for research, education, language learning, and interactive applications that require multimodal inputs. The compact size and quantized variants facilitate deployment on various platforms.
The ability to process long context windows enables document analysis, code synthesis from extended instructions, and contextually aware dialogue systems. Wide coverage of over 140 languages further broadens utility in global contexts.
Model Family and Comparison
Gemma 3 encompasses a spectrum of model sizes, spanning from 1B to 27B parameters. The 12B variant represents a mid-scale balance between capability and efficiency. The architecture is co-designed with the Gemini frontier models, and the largest Gemma 3 variant (27B) has benchmark results comparable to Gemini-1.5-Pro, according to the LMSys Chatbot Arena.
Compared to prior generations, Gemma 3 models show different benchmark results than their Gemma 2 predecessors, benefiting from larger and more diverse training data, architectural modifications for efficiency in long-context inference, and enhanced multimodal coverage.
Limitations
Despite its capabilities, Gemma 3 12B shares commonly recognized limitations of large language models. It relies on statistical patterns from its training data and may produce inaccurate or outdated factual statements, display challenges with nuanced language phenomena (such as sarcasm or implied meaning), and manifest biases or artifacts from its data sources. While filtering and reinforcement learning mitigate some risks, the potential for generating harmful content, omission of common-sense reasoning, or contextually inappropriate outputs persists.
The multilingual evaluation conducted internally focuses predominantly on English, and the model's performance in other languages may exhibit variance depending on available data. Performance also declines if the input context length exceeds the supported 128,000 token window.
Licensing and Accessibility
Gemma 3 12B is distributed under terms that require users to review and agree to Google's license. The usage policy specifies responsibilities and restricted use cases. Model checkpoints, quantized formats, and technical documentation are available primarily for research and development purposes.