Browse Models

Qwen /

QwQ 32B

Family

Qwen 2

Type

Foundation Model

License

Apache-2.0 License

Released

2025-03-05

How To Use

Laboratory OS

The simplest way to self-host QwQ 32B. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.

Direct Download

Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

Qwen / QwQ 32B

QwQ-32B is a 32.5B parameter reasoning-focused model that competes with much larger alternatives through specialized reinforcement learning. Built on Qwen2.5-32B, it features outcome-based rewards scaling targeting math and coding. With 131K context length and strong performance in reasoning tasks, it works best with temperature 0.6 and TopP 0.95.

Explore the Future of AI

Your server, your data, under your control

QwQ-32B is a medium-sized reasoning model in the Qwen series that delivers exceptional reasoning capabilities despite its relatively modest parameter count. Developed as part of the Qwen family of models, QwQ-32B achieves performance comparable to much larger models such as DeepSeek-R1, which has 671 billion parameters (with 37 billion activated). This impressive feat demonstrates the effectiveness of reinforcement learning (RL) when applied to robust foundation models.

The model is designed specifically for enhanced performance in downstream tasks, especially complex problems requiring sophisticated reasoning. It effectively competes with state-of-the-art reasoning models like DeepSeek-R1 and o1-mini. What makes QwQ-32B particularly noteworthy is its ability to integrate agent-related capabilities into its reasoning model, enabling it to think critically, utilize tools, and adapt its reasoning based on environmental feedback.

QwQ-32B is open-weight and available on Hugging Face and ModelScope under the Apache 2.0 license, making it accessible for research and development purposes. It's also available for interactive use via Qwen Chat.

Architecture and Technical Details

QwQ-32B is a causal language model based on Qwen2.5-32B, featuring a transformer-based architecture with several key technical components:

Parameters:
- Total: 32.5 billion
- Non-Embedding: 31.0 billion
Layers: 64
Attention Heads: 40 (Q), 8 (KV) using Grouped Query Attention (GQA)
Context Length: 131,072 tokens
Technical Components: RoPE (Rotary Positional Encoding), SwiGLU activation, RMSNorm, and Attention QKV bias

The training process for QwQ-32B included multiple stages:

Pretraining: Building on the foundation of Qwen2.5-32B
Post-training: Including Supervised Finetuning (SFT) and Reinforcement Learning (RL)

The reinforcement learning approach used for QwQ-32B is particularly noteworthy and differentiates it from many other models. QwQ-32B utilizes an outcome-based rewards RL scaling approach. The initial stage focuses on math and coding tasks, using an accuracy verifier for math problems and a code execution server for coding tasks, instead of traditional reward models. A second stage of RL is applied for general capabilities, trained with rewards from a general reward model and rule-based verifiers.

This RL-enhanced training significantly improved several capabilities:

Instruction following
Alignment with human preference
Agent performance in reasoning tasks

Due to its architecture and extensive context window of 131,072 tokens, QwQ-32B can handle very long inputs, making it suitable for complex reasoning tasks that require consideration of extensive context.

Performance and Benchmarks

QwQ-32B's performance has been evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results highlight QwQ-32B's impressive performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Notably, despite having significantly fewer parameters than models like DeepSeek-R1 (671B), QwQ-32B achieves comparable or superior performance on many benchmarks. This efficiency demonstrates the effectiveness of the reinforcement learning approach used in training QwQ-32B.

Detailed evaluation results and performance metrics can be found in the QwQ-32B blog post and the Qwen2 Technical Report.

Usage Guidelines and Best Practices

To get the most out of QwQ-32B, the following usage guidelines are recommended:

Optimal Parameter Settings

Temperature: 0.6
TopP: 0.95
TopK: Between 20 and 40

Enforcing Thoughtful Output

To encourage the model to think through problems carefully before responding, ensure that prompts instruct the model to start with <think>\n. This encourages the model to display its reasoning process.

Standardizing Output Format

For consistent results, use standardized prompts. For example, when working with mathematical problems, include instructions like "Please reason step by step, and put your final answer within \boxed."

Long Context Handling

For inputs exceeding 32,768 tokens, consider enabling YaRN (Yet Another RoPE extensioN method) to improve long-sequence information capture. This can be done by adding the following configuration to config.json:

{
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

Note that using vLLM for deployment is recommended. Static YaRN is supported by vLLM, but may impact performance on shorter texts. Therefore, it's advised to add the rope_scaling configuration only when processing long contexts is required. The YaRN method is described in detail in this paper.

Technical Requirements

QwQ-32B requires the latest version of Hugging Face's transformers library because it is based on Qwen2.5. Users should ensure they have the most recent version installed before using the model.

Deployment

For efficient deployment, vLLM is recommended. Detailed guidance on deploying Qwen models using vLLM can be found in the Qwen documentation. Speed and throughput benchmarks are also available in the speed benchmark documentation.

References

QwQ-32B on Hugging Face - Official model repository
QwQ-32B Blog Post - Detailed information about the model's development and performance
Qwen Chat - Interactive platform featuring QwQ-32B
QwQ-32B Demo - Interactive demonstration of the model
Qwen2.5 GitHub Repository - Source code for the base model
Qwen Documentation - Comprehensive documentation for Qwen models
YaRN Paper - Research paper describing the YaRN method
vLLM Deployment Guide - Guide for deploying Qwen models with vLLM
Speed Benchmark - Performance metrics for Qwen models
QwQ-32B License - Apache 2.0 license for the model
Qwen2 Technical Report - Technical details about the Qwen2 model family
QwQ-32B on ModelScope - Alternative model repository

Image Captions:

Performance comparison of various models across different benchmarks with QwQ-32B and DeepSeek-R1-671B highlighted.
Bar chart comparing performance metrics of various models across different benchmarks.