Browse Models
The simplest way to self-host QwQ 32B. Launch a dedicated cloud GPU server running Laboratory OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
QwQ-32B is a 32.5B parameter reasoning-focused model that competes with much larger alternatives through specialized reinforcement learning. Built on Qwen2.5-32B, it features outcome-based rewards scaling targeting math and coding. With 131K context length and strong performance in reasoning tasks, it works best with temperature 0.6 and TopP 0.95.
QwQ-32B is a medium-sized reasoning model in the Qwen series that delivers exceptional reasoning capabilities despite its relatively modest parameter count. Developed as part of the Qwen family of models, QwQ-32B achieves performance comparable to much larger models such as DeepSeek-R1, which has 671 billion parameters (with 37 billion activated). This impressive feat demonstrates the effectiveness of reinforcement learning (RL) when applied to robust foundation models.
The model is designed specifically for enhanced performance in downstream tasks, especially complex problems requiring sophisticated reasoning. It effectively competes with state-of-the-art reasoning models like DeepSeek-R1 and o1-mini. What makes QwQ-32B particularly noteworthy is its ability to integrate agent-related capabilities into its reasoning model, enabling it to think critically, utilize tools, and adapt its reasoning based on environmental feedback.
QwQ-32B is open-weight and available on Hugging Face and ModelScope under the Apache 2.0 license, making it accessible for research and development purposes. It's also available for interactive use via Qwen Chat.
QwQ-32B is a causal language model based on Qwen2.5-32B, featuring a transformer-based architecture with several key technical components:
The training process for QwQ-32B included multiple stages:
The reinforcement learning approach used for QwQ-32B is particularly noteworthy and differentiates it from many other models. QwQ-32B utilizes an outcome-based rewards RL scaling approach. The initial stage focuses on math and coding tasks, using an accuracy verifier for math problems and a code execution server for coding tasks, instead of traditional reward models. A second stage of RL is applied for general capabilities, trained with rewards from a general reward model and rule-based verifiers.
This RL-enhanced training significantly improved several capabilities:
Due to its architecture and extensive context window of 131,072 tokens, QwQ-32B can handle very long inputs, making it suitable for complex reasoning tasks that require consideration of extensive context.
QwQ-32B's performance has been evaluated across a range of benchmarks designed to assess its mathematical reasoning, coding proficiency, and general problem-solving capabilities. The results highlight QwQ-32B's impressive performance in comparison to other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.
Notably, despite having significantly fewer parameters than models like DeepSeek-R1 (671B), QwQ-32B achieves comparable or superior performance on many benchmarks. This efficiency demonstrates the effectiveness of the reinforcement learning approach used in training QwQ-32B.
Detailed evaluation results and performance metrics can be found in the QwQ-32B blog post and the Qwen2 Technical Report.
To get the most out of QwQ-32B, the following usage guidelines are recommended:
To encourage the model to think through problems carefully before responding, ensure that prompts instruct the model to start with <think>\n
. This encourages the model to display its reasoning process.
For consistent results, use standardized prompts. For example, when working with mathematical problems, include instructions like "Please reason step by step, and put your final answer within \boxed."
For inputs exceeding 32,768 tokens, consider enabling YaRN (Yet Another RoPE extensioN method) to improve long-sequence information capture. This can be done by adding the following configuration to config.json
:
{
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}
Note that using vLLM for deployment is recommended. Static YaRN is supported by vLLM, but may impact performance on shorter texts. Therefore, it's advised to add the rope_scaling
configuration only when processing long contexts is required. The YaRN method is described in detail in this paper.
QwQ-32B requires the latest version of Hugging Face's transformers
library because it is based on Qwen2.5. Users should ensure they have the most recent version installed before using the model.
For efficient deployment, vLLM is recommended. Detailed guidance on deploying Qwen models using vLLM can be found in the Qwen documentation. Speed and throughput benchmarks are also available in the speed benchmark documentation.
Image Captions: