Browse Models

Google /

Gemma 3 1B

Family

Gemma 3

Type

Foundation Model

License

Gemma License

Released

2025-03-12

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run Gemma 3 1B using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

Google / Gemma 3 1B

Gemma 3 1B is a lightweight, multimodal generative AI model developed by Google DeepMind that processes both text and images to generate text outputs. Built on a decoder-only transformer architecture with local and global attention layers, the model supports a 32,000-token context window and was trained on 2 trillion tokens across 140+ languages. The model offers open-source pre-trained and instruction-tuned weights for research and practical applications.

Explore the Future of AI

Your server, your data, under your control

Gemma 3 1B is a lightweight, open-source, multimodal generative artificial intelligence model developed by Google DeepMind. Belonging to the Gemma 3 family, this model is designed for both research and practical applications, building upon the architecture and methodologies established in the Gemini models. Gemma 3 1B offers open access to its pre-trained and instruction-tuned weights, facilitating transparency and broadening its reach for a variety of language and vision tasks, as described in the technical report.

Model Architecture and Technical Innovations

Gemma 3 1B is built upon a decoder-only transformer backbone, inheriting architectural influences from the Gemini model series as documented in the Gemini family research paper. The model incorporates a range of innovations designed for efficiency and scalability. Notably, it utilizes a combination of local and global attention layers, employing a ratio of five local layers to one global layer to enable a context window of up to 32,000 tokens. This structure leverages a sliding window mechanism for local self-attention, while the global layers utilize an extended rotary positional embedding (RoPE), increasing the RoPE base frequency to accommodate long-range dependencies.

The vision capabilities of Gemma 3 1B are realized through a 400 million parameter variant of the SigLIP encoder, which is fixed during the language model training phase. Image inputs, processed as 896 × 896 pixel squares, are encoded as sequences of 256 tokens, and a "Pan and Scan" (P&S) method is used to flexibly segment larger or non-square images into non-overlapping crops, ensuring efficiency and adaptability for various visual data. Vision-language fusion relies on combining these visual tokens with text inside the multimodal transformer architecture.

Further technical features include Grouped-Query Attention (GQA) using RMSNorm normalization, and the adoption of QK-norm for stabilizing attention activations. Quantization Aware Training (QAT) complements deployment flexibility by providing optimized checkpoint formats, such as per-channel and per-block int4 and switched fp8, to minimize the model's inference footprint (detailed architecture and optimization techniques).

Training Data and Methodology

Gemma 3 1B was trained on a corpus encompassing 2 trillion text tokens, sourced from a broad spectrum of textual and multimodal data in more than 140 languages. The dataset is deliberately balanced to improve the representation of non-English languages and comprises web documents, code, mathematics, and images. A rigorous data filtering process is applied to mitigate the presence of sensitive material, personal data, and low-quality content, with additional safeguards to ensure decontamination of benchmark evaluation sets.

The model's tokenizer is the SentencePiece implementation previously used in Gemini 2.0, supporting a vocabulary of 262,000 elements, and is specifically optimized for multilingual coverage. All model variants underwent knowledge distillation, where a smaller "student" model is trained to replicate the token-level output distributions of a more capable "teacher" model. After the initial pre-training phase, instruction-tuned versions are further refined using a combination of curated instruction-following datasets, reinforcement learning objectives (including learning from human and automated feedback), and advanced data filtering strategies to promote safe and factually grounded model outputs.

Multimodal and Multilingual Capabilities

Gemma 3 1B is a multimodal model, capable of processing both text and image inputs to generate text outputs. Images undergo normalization and are represented as compact token sequences to enable efficient integration within the transformer architecture. During inference, the P&S algorithm ensures that high-resolution or non-square images are each effectively segmented for encoding. This multimodal capability supports applications in visual question answering, image captioning, and extraction of structured data from images.

The multilingual design of the model is reflected in its broad language coverage, supporting text understanding and generation in over 140 languages. Language data are explicitly balanced during training to enhance coverage for low-resource and non-English languages, as reported in the Gemma documentation. The instruction-tuning process further incorporates multilingual objectives to enhance generalization across diverse linguistic tasks.

Performance and Evaluation

Extensive benchmarking reveals the performance of Gemma 3 1B across a spectrum of reasoning, factuality, multilingual, and multimodal tasks. For reasoning and factuality, pre-trained model evaluations include scores such as 62.3% on HellaSwag (10-shot), 63.2% on BoolQ (0-shot), and 73.8% on PIQA (0-shot), as detailed in the Gemma 3 technical report. On multilingual benchmarks, the model achieves 24.9% on Global-MMLU-Lite and 43.9% on XQuAD.

Instruction-tuned versions demonstrate improvements on tasks that emphasize instruction-following and code generation. For instance, on the MBPP (code generation) and GSM8K (mathematical reasoning) benchmarks, Gemma 3 1B posts scores of 35.2% and 62.8%, respectively. Multimodal performance is also evaluated, although the 1B model's primary focus remains on text generation with visual context support.

It is observed that model performance generally scales with size across the Gemma 3 family—larger models (Gemma 3 4B, Gemma 3 12B, Gemma 3 27B) consistently outperform their smaller counterparts but require increased computational resources. Nevertheless, Gemma 3 1B is designed for deployment in resource-constrained environments while retaining strong capabilities, particularly within the context window of 32,000 tokens supported by its optimized attention architecture.

Use Cases, Limitations, and Responsible Development

Typical applications for Gemma 3 1B include content creation, conversational AI (chatbots), knowledge extraction from text and images, summarization, educational research, and language learning. Its size makes it suitable for deployment on personal hardware and in constrained computational environments.

The model's limitations stem primarily from constraints in data coverage and output reliability, as is typical with generative large language models. Potential shortcomings include sensitivity to ambiguous input instructions, difficulty with nuanced or figurative language, challenges in factual consistency, and an inherent risk of generating outputs with memorized training content. Extensive evaluations suggest that, despite an emphasis on safety, certain limitations—such as incomplete coverage of specific domains or low performance on specialized knowledge (e.g., Chemical, Biological, Radiological, and Nuclear risks)—persist (Gemma 3 Report on limitations and evaluation). Efforts to mitigate risks include advanced data filtering, decontamination of benchmark sets, reinforcement learning with human and automated feedback, and continual monitoring for harmful or unsafe outputs.

Responsible use of Gemma models is governed by dedicated usage terms and a prohibited use policy. Users are encouraged to reference the Responsible Generative AI Toolkit for best practices in safety evaluation and deployment.

Model Availability and Resources

Gemma 3 1B is released under a custom Google license, with both pre-trained and instruction-tuned weights accessible through official distribution channels. The official Gemma documentation page provides comprehensive technical details, while the Gemma 3 technical report offers in-depth analysis and benchmarking. The model is accompanied by quantized versions for efficient deployment, and the community is encouraged to consult details on model training, architecture, and responsible use within the technical and legal resources.

Gemma 3 1B

Laboratory OS

Direct Download

Open WebUI

Text Generation Web UI

Explore the Future of AI

Your server, your data, under your control

Gemma 3 1B

Laboratory OS

Direct Download

Open WebUI

Text Generation Web UI

Explore the Future of AI

Your server, your data, under your control

Model Architecture and Technical Innovations

Training Data and Methodology

Multimodal and Multilingual Capabilities

Performance and Evaluation

Use Cases, Limitations, and Responsible Development

Model Availability and Resources

Helpful Links