WizardLM 70B is a large language model developed through a collaboration between Microsoft and Peking University, belonging to the broader WizardLM family of models. Released in August 2023, WizardLM 70B builds on advancements in instruction-following natural language processing by employing a specific data construction and fine-tuning methodology designed to improve the ability of language models to follow complex and nuanced instructions. The model employs an automated evolution pipeline for dataset creation.
Data Generation and Evol-Instruct Methodology
A defining aspect of WizardLM 70B is its data generation pipeline, known as Evol-Instruct. Unlike conventional manual curation of instruction datasets, Evol-Instruct utilizes a large language model to autonomously and iteratively rewrite a base set of instructions into increasingly complex forms. This process consists of several components: the Instruction Evolver produces new prompts by making instructions more challenging or covering a broader array of topics, the Response Generation module outputs answers to these evolved prompts, and the Elimination Evolving step removes ineffective or problematic instructions.
The evolutionary process begins with an initial collection of instructions, such as the 52k dataset from Alpaca. Through multiple rounds (typically four), the Instruction Evolver transforms these seed instructions, introducing additional constraints, deeper reasoning steps, or new conceptual challenges. The resulting expanded dataset is subjected to quality filters to ensure that only meaningful and answerable instructions remain. Ultimately, the instruction set grows in both size and diversity, providing a robust foundation for fine-tuning.
Model Architecture and Training
WizardLM 70B leverages the Llama 2 70B architecture as its foundational base. Building on this, the model undergoes instruction fine-tuning utilizing the curated dataset produced by the Evol-Instruct pipeline. To ensure a fair distribution of prompt complexity and diversity, the final fine-tuning corpus combines both initial and evolved instructions, with randomized sampling across difficulty levels.
The conversational prompt format adopted for WizardLM 70B closely mirrors that used by Vicuna, structuring interactions as extended dialogs between a user and an AI assistant. For example: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions. USER: Hi ASSISTANT: Hello.</s> USER: Who are you? ASSISTANT: I am WizardLM.</s>..."
Training took place over three epochs, employing DeepSpeed ZeRO-3 distributed training strategies to efficiently manage model parameters, gradients, and optimizer states at scale.
Datasets and Experimental Design
The dataset construction for WizardLM 70B begins with the Alpaca 52k instruction dataset. The Evol-Instruct pipeline applies four rounds of evolution to each original prompt, generating approximately 250,000 unique instructions—each a product of randomly chosen in-depth or in-breadth evolution prompts. To enable comparison with other models such as Vicuna, a subset of 70,000 instructions is selected for the final fine-tuning process.
Executing the full evolution required an estimated 624,000 API calls, underscoring the computational scale underlying the construction of such a diverse and challenging dataset. The diversity and complexity resulting from this approach have been shown to improve downstream instruction-following capabilities.
Performance and Evaluation
WizardLM 70B has been extensively benchmarked across a range of evaluation tasks relevant to language reasoning, mathematics, and code generation. In automated benchmark testing, WizardLM 70B achieved scores such as 63.32 on MMLU, 64.52 on ARC, 83.21 on HellaSwag, and demonstrated scores in mathematics and code, including 70.61 on GSM8k and 42.1 pass@1 on HumanEval.
Alternative evaluations via HuggingFace's model repository report scores of 7.78 on MT-Bench, 92.91% on AlpacaEval, 77.6% on GSM8k, and 50.6 pass@1 on HumanEval. The effectiveness of the Evol-Instruct approach is confirmed through ablation studies, revealing improved performance compared to models fine-tuned on non-evolved or less diverse instruction sets.
Limitations
Despite its advancements, WizardLM 70B faces certain limitations. The scalability and reliability of automatic evaluation via models such as GPT-4 and human assessments are noted challenges, as evaluation coverage may not fully reflect the breadth of real-world language understanding. Additionally, the proprietary composition of some test sets, such as WizardEval, means that there may be domains or application areas not thoroughly represented within the standard evaluation procedures.
Licensing and Family Variants
WizardLM 70B is released under the terms of the Llama 2 License. The WizardLM family encompasses a variety of specialized models, including WizardCoder for code generation and WizardMath for advanced mathematical reasoning. These models, available in multiple sizes and with instruction sets targeting domain-specific challenges, demonstrate the adaptability of the Evol-Instruct methodology across diverse large language model architectures.
Helpful Links