Qwen 2.5 Math 7B

Family

Qwen 2

Type

Foundation Model

License

Apache-2.0 License

Released

2024-09-19

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run Qwen 2.5 Math 7B using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

Alibaba Cloud / Qwen 2.5 Math 7B

Qwen 2.5 Math 7B is a 7.62-billion parameter language model developed by Alibaba Cloud that specializes in mathematical reasoning tasks in English and Chinese. The model employs chain-of-thought reasoning and tool-integrated approaches using Python interpreters for computational tasks. It demonstrates improved performance over its predecessor on mathematical benchmarks including MATH, GSM8K, and Chinese mathematics evaluations, achieving 83.6 on MATH using chain-of-thought methods.

Explore the Future of AI

Your server, your data, under your control

Qwen 2.5 Math 7B is a large language model (LLM) developed to address complex mathematical reasoning tasks in both English and Chinese. As part of the Qwen2.5-Math series, this 7.62-billion parameter model is engineered for enhanced accuracy in mathematical problem-solving, leveraging advanced reasoning techniques and integrating external computational tools. The Qwen2.5-Math family comprises multiple model sizes, instruction-tuned variants, and a specialized mathematical reward model. The series reflects iterative advances over its predecessor, Qwen2-Math, offering improvements in data scale, bilingual capabilities, and benchmark performance.

Line chart showing mathematical performance trends of major open-weight and closed-source LLMs from January to September 2024, highlighting Qwen2.5-Math-72B-Instruct at the top.

Model Architecture and Training Pipeline

Qwen2.5-Math-7B builds upon the Qwen2.5 base model architecture, inheriting language understanding abilities as well as code and text reasoning. The model’s mathematical proficiency is developed through a comprehensive training pipeline that includes pre-training, supervised fine-tuning (SFT), reward modeling, and reinforcement learning.

A distinguishing aspect of the Qwen2.5-Math series is the integration of an expanded and refined mathematical corpus. Pre-training data, synthesized with earlier iterations such as Qwen2-Math-72B-Instruct, is augmented with a large volume of domain-specific material in both English and Chinese—including datasets sourced from the web, academic texts, and mathematical code repositories. This resulted in the creation of the Qwen Math Corpus v2, which exceeds 1 trillion tokens and supports a 4K context length.

The specialization pipeline includes instruction-tuning with conversational data, training a 72B-parameter mathematical reward model for supervised fine-tuning (using rejection sampling), and post-training enhancements using both tool-integrated reasoning (TIR) and chain-of-thought (CoT) data generation.

Pipeline diagram showing data flow and specialization stages for Qwen2-Math and Qwen2.5-Math.

Approaches to Mathematical Reasoning

Qwen2.5-Math-7B employs two primary approaches for mathematical problem solving. The first, chain-of-thought (CoT), structures solutions via step-by-step logical reasoning, which can be applied to complex, multi-step problems. The second approach, tool-integrated reasoning (TIR), augments the model's symbolic and algorithmic computation abilities by embedding external tools within its workflow, notably leveraging a Python interpreter for code-based calculations.

TIR has resulted in performance gains, particularly in benchmarks requiring precise computation or symbolic manipulation. In addition, the instruction-tuned models provide conversational formats suited to interactive educational or tutoring settings, while base models are optimized for prompt completion and as foundations for further fine-tuning.

Screenshot shows Qwen2.5-Math-7B using Tool-Integrated Reasoning to solve a mathematical problem with Python code execution in a chatbot interface.

Benchmark Performance and Evaluation

Qwen2.5-Math-7B demonstrates improved performance over earlier iterations across multiple benchmarks in both English and Chinese. The model’s performance has been assessed using datasets such as GSM8K, MATH, MMLU STEM, CMATH, and GaoKao Math QA. In CoT settings, Qwen2.5-Math-7B shows increased performance, outperforming its Qwen2-Math-7B predecessor by 5.0 points on MATH and by 12.2 points on Chinese high school math QA.

Instruction-tuned Qwen2.5-Math-Instruct variants have been evaluated on problems from OlympiadBench, AIME 2024, and AMC 2023. The 7B Instruct model registered MATH benchmark scores of 83.6 (CoT) and 85.3 (TIR). The 72B model, Qwen2.5-Math-72B-Instruct, reached a score of 92.9 on MATH (TIR RM@8), demonstrating performance comparable to or exceeding that of several closed-source models.

Scatter plot showing model performance (Zero-shot@1 MATH accuracy) against number of parameters, highlighting Qwen2.5-Math models.

Chinese-language benchmarks also reflect favorable results, with Qwen2.5-Math-7B demonstrating high accuracy across GaoKao, CMATH, and CN Middle School 24 tasks.

When presented with competition benchmarks such as AIME 2024 and AMC 2023, the Qwen2.5-Math-Instruct model demonstrates effective problem-solving capabilities when applied to competition benchmarks such as AIME 2024 and AMC 2023.

Table comparing English language math benchmark results for instruction-tuned models, highlighting Qwen2.5-Math-Instruct.

Table comparing Chinese language math benchmark results for various instruction-tuned models.

Model Details and Releases

Qwen2.5-Math-7B incorporates roughly 7.62 billion parameters and uses bfloat16 (BF16) tensor types for efficient computation. The base model, Qwen/Qwen2.5-7B, is offered alongside an instruction-tuned version designed for dialogue-based tutoring and interactive use. The Qwen2.5-Math series was publicly released in September 2024, following the initial launch of Qwen2-Math in August 2024, and is distributed openly for research and development.

The series also includes a 72B-parameter reward model (Qwen2.5-Math-RM-72B), employed for supervised data selection and reinforcement learning via Group Relative Policy Optimization (GRPO).

Limitations and Decontamination

Qwen2.5-Math-7B is specialized for mathematical reasoning in English and Chinese using chain-of-thought and tool-integrated approaches. Its application to general text tasks beyond mathematics is not recommended. While CoT enhances stepwise reasoning, it remains limited in direct computation and some algorithmic scenarios, which TIR partially addresses.

To ensure the integrity of performance benchmarks, comprehensive decontamination procedures were implemented throughout the data pipeline. Potentially overlapping training and test samples were identified and excluded using 13-gram matching and longest common subsequence ratios, reducing bias especially for benchmarks like GSM8K, MATH, Minerva Math, Olympiad Bench, and national mathematics exams. Specialized filtering was also conducted for post-training and supervised datasets, excluding not only matched sample problems but also problems with similar concepts.

References and Further Reading

For further technical details, model code, benchmarks, and demonstration environments, consult the following resources:

For citation, please refer to the official publication: Yang, A. et al., "Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement", arXiv preprint arXiv:2409.12122 (2024).