Launch a dedicated cloud GPU server running Laboratory OS to download and run Gemma 3 4B using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.
The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.
Model Report
Google / Gemma 3 4B
Gemma 3 4B is a multimodal instruction-tuned model developed by Google DeepMind that processes text and image inputs to generate text outputs. The model features a decoder-only transformer architecture with approximately 4.3 billion parameters, supports context windows up to 128,000 tokens, and operates across over 140 languages. It incorporates a SigLIP vision encoder for image processing and utilizes grouped-query attention with interleaved local and global attention layers for efficient long-context handling.
Explore the Future of AI
Your server, your data, under your control
Gemma 3 4B is a multimodal, instruction-tuned generative AI model developed by Google DeepMind, forming part of the Gemma 3 family of lightweight, open models. Designed to process both text and image inputs while producing text outputs, Gemma 3 4B focuses on efficient computation, making it suitable for deployment even on standard consumer devices such as laptops and mobile phones. The model introduces innovations in long-context handling, multimodal reasoning, and memory efficiency, and is trained to function across over 140 languages. Gemma 3 4B is released under an open usage license, with specific terms outlined by Google.
Gemma 3 4B can process images such as this, enabling open-ended question answering on real-world photos. Prompt: 'What animal is on the candy?'
Gemma 3 4B achieves multimodal functionality through integration with a purpose-built SigLIP vision encoder. This vision encoder processes images at 896 x 896 resolution and compresses visual information into a form compatible with the model's language system. During inference, an adaptive algorithm called Pan and Scan (P&S) allows the model to handle non-square images and small details by generating non-overlapping crop regions, each resized to the required input format. This optimizes both readability and recognition when images of varying aspect ratios are provided.
The model supports long context windows, capable of handling up to 128,000 tokens in a single sequence. This is accomplished by an architecture that interleaves local and global attention layers, specifically employing a 5:1 ratio of local to global attention, with local contexts capped at 1024 tokens. The Rotary Position Embedding (RoPE) base frequency for global layers is increased from 10,000 to 1,000,000, enabling efficient long-range token dependency modeling. These modifications also reduce the memory footprint associated with long-context attention, an advantage substantiated by empirical reductions in KV-cache memory usage during evaluation.
Gemma 3 4B's multilingual capabilities are enhanced through a revised data mixture, supporting over 140 languages and balancing performance across linguistic groups. The model includes quantized checkpoint variants—such as per-channel int4 and switched floating point 8—which are produced via quantization-aware training, further improving usability on diverse hardware platforms.
Gemma 3 4B's vision-language integration supports image-to-text tasks such as detailed scene description. Prompt: 'Describe this image in detail.'
Gemma 3 4B utilizes a decoder-only transformer architecture, iteratively refined from earlier Gemma and Gemini models. Primary innovations include the use of Grouped-Query Attention (GQA) combined with RMSNorm, and a distinctive interleaving pattern of attention layers that improves scalability with longer context windows. The model swaps the "soft-capping" technique from Gemma 2 for QK-norm to better balance attention scaling.
The SigLIP-based vision encoder, containing 417 million parameters, is shared and frozen across the largest Gemma 3 models. Model parameters are distributed as follows: 3.2 billion non-embedding parameters, 675 million embedding parameters, and the 417 million parameter vision encoder, summing to approximately 4.3 billion parameters as reported in the Hugging Face model card.
For training, Gemma 3 4B was pre-trained on 4 trillion tokens comprising web documents, code, mathematical content, and images. The model employs the SentencePiece tokenizer with 262,000 entries, ensuring improved tokenization for non-English scripts. All Gemma 3 models implement knowledge distillation, where the student model learns from teacher-generated logits over sampled token distributions, enabling efficient transfer of instruction-following behaviors. Post-training is further enhanced using reinforcement learning objectives, based on advanced reward models and feedback mechanisms focusing on helpfulness, mathematical and coding accuracy, reasoning, and minimizing harmful outputs.
Training infrastructure consisted of TPUv5e hardware and JAX-based software stacks, utilizing distributed optimization methods such as ZeRO-3 to manage scaling and memory requirements.
Evaluation and Performance
Gemma 3 4B demonstrates improvements over prior Gemma models across standard NLP and VLM benchmarks. On the MMLU-Pro dataset, the model achieves a score of 43.6, significantly higher than Gemma 2 2B and comparable to larger previous models. For multimodal evaluation, instruction-tuned Gemma 3 4B achieves notable results on benchmarks such as DocVQA (75.8) and MMMU (48.8), both with Pan & Scan enabled during inference.
In terms of pre-trained model performance, Gemma 3 4B achieves a HellaSwag score of 77.2, and on complex STEM/code tasks like HumanEval, it attains a 36.0 score. Multilingual evaluation reveals strong generalization, with a Global-MMLU-Lite score of 57.0.
Long-context evaluations using RULER and MRCR benchmarks indicate that Gemma 3 4B maintains coherence and recall at context sizes up to 128,000 tokens, with only gradual degradation. Safety evaluations indicate a low memorization rate for long-form text, with detected memorization instances most often being approximate paraphrases rather than verbatim recitations. No personal information was identified within model memorization assessments, adhering to responsible AI practices.
Applications and Usage
Due to its emphasis on efficiency and modality flexibility, Gemma 3 4B is well-suited for a range of research and application scenarios involving text and image understanding. Common applications include content generation, chat-based interfaces, code synthesis, mathematical reasoning, multilingual translation, and visual question answering. The model's relatively compact size supports deployment in resource-constrained environments, while its architecture allows for efficient operation even with long input sequences or large batch sizes.
Typical use cases demonstrated in official documentation include interactive chat, detailed image captioning, image-grounded question answering, and language learning. Content safety and factuality have been improved through a combination of pre-training and post-training filtering, as well as reinforcement learning on diverse data, including human feedback.
Limitations
Despite advances in architecture and data processing, Gemma 3 4B reflects certain inherent limitations associated with current generative AI technology. Risks associated with evaluation set contamination are mitigated but remain possible. While memorization is lower compared to predecessors, model outputs can still include approximations of training content. The diversity and quality of pre-training data affect both breadth and depth of knowledge, potentially resulting in gaps or biases.
The model's ability to handle complex, ambiguous, or highly context-dependent natural language remains an area with challenges, as does absolute factual accuracy, given that its outputs are based on learned data patterns rather than curated knowledge. Evaluations in linguistic safety were primarily conducted in English, highlighting potential limitations in other languages. Furthermore, general issues such as common sense reasoning, the risk of generating plausible but incorrect statements, and the perpetuation of biases found in training data continue to be areas warranting further research.
Comparison within the Gemma Family
Gemma 3 4B sits between smaller (Gemma 3 1B) and larger (Gemma 3 12B, Gemma 3 27B) Gemma 3 models, all designed around a similar multimodal transformer backbone and sharing a frozen vision encoder. The 4B, 12B, and 27B variants support the maximum 128K token context window, while the 1B variant supports 32K tokens. Increased parameter scale is associated with improved performance, particularly in complex benchmarks and multilingual capabilities, as documented in the Gemma 3 Technical Report.
Empirical results show the 4B model performing comparably to previous 27B parameter models from the Gemma 2 generation, and the scaling trend holds across other model sizes.
Licensing and Responsible Use
Gemma 3 4B is released as an open model, with distribution and usage governed by Google's terms and a prohibited use policy. Potential users are required to review and agree to these terms before downloading and deploying the model.