Launch a dedicated cloud GPU server running Laboratory OS to download and run Phi-2 2.7B using any compatible app or framework.
Direct Download
Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.
Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.
The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.
Model Report
microsoft / Phi-2 2.7B
Phi-2 is a 2.7 billion parameter Transformer-based language model developed by Microsoft Research and released in December 2023. The model was trained on approximately 1.4 trillion tokens using a "textbook-quality" data approach, incorporating synthetic data from GPT-3.5 and filtered web sources. Phi-2 demonstrates competitive performance in reasoning, language understanding, and code generation tasks compared to larger models in its parameter class.
Explore the Future of AI
Your server, your data, under your control
Phi-2 is a Transformer-based language model with 2.7 billion parameters, developed by Microsoft Research and released in December 2023. Designed as part of the Phi series of small language models (SLMs), Phi-2 emphasizes efficient performance in reasoning, language understanding, and code generation, delivering competitive performance within its parameter class.
Satya Nadella announces Phi-2 at Microsoft Ignite 2023, highlighting its characteristics in the small language model domain.
Phi-2 adopts a Transformer architecture, with a maximum input context of 2048 tokens. The architecture utilizes PyTorch as its primary deep learning framework, augmented by DeepSpeed for efficient distributed training and Flash-Attention to optimize attention computation.
A distinguishing aspect of Phi-2's development is the emphasis on "textbook-quality" data for pretraining, a strategy informed by the research direction outlined in "Textbooks Are All You Need" and its sequel "Textbooks Are All You Need II: phi-1.5 technical report". Building on the approach used for its predecessor Phi-1.5, the Phi-2 model leverages knowledge transfer methods to incorporate previously learned capabilities, thereby accelerating convergence and improving performance across evaluation benchmarks.
Phi-2 was trained on a dataset encompassing approximately 1.4 trillion tokens, drawn from a curated selection of 250 billion "textbook-quality" tokens. This corpus includes synthetic natural language data generated by Azure OpenAI GPT-3.5 as well as filtered web data from both Falcon RefinedWeb and SlimPajama projects. Quality selection and filtering of this data were further guided by evaluations using Azure OpenAI GPT-4.
Technical Performance and Benchmarks
Phi-2 demonstrates performance across core areas such as commonsense reasoning, language understanding, and multi-step tasks in coding and mathematics. When compared to other open-source models under 13 billion parameters, Phi-2 demonstrates competitive performance on specific benchmarks.
Phi-2 (green) compared to its predecessor Phi-1.5 (blue) on commonsense reasoning, language understanding, math/coding, and BigBench-Hard benchmarks. The charts illustrate Phi-2’s improvements across all evaluated domains.
On aggregate grouped benchmarks, Phi-2 achieves a score of 59.2 on BigBench-Hard, 68.8 on commonsense reasoning, 62.0 on language understanding, 61.1 on math, and 53.7 on coding, demonstrating performance exceeding that of Llama-2 7B, Llama-2 13B, and Mistral-7B in several categories. Additionally, Phi-2 demonstrates comparable or superior performance to the Gemini Nano 2 model in domains such as question answering and code generation, despite its smaller parameter count.
Practical examples from evaluation show Phi-2’s ability to handle complex problem-solving and error correction tasks. For instance, Phi-2 is capable of both deriving step-by-step solutions to physics problems and diagnosing errors in student-submitted calculations.
Phi-2 provides a detailed, step-by-step solution to a physics problem, demonstrating its reasoning and calculation skills.
Phi-2 identifies and corrects a student's mistake in a physics calculation in a chat setting, showcasing error correction and educational assistance capabilities.
Phi-2 is designed as a general-purpose language model, with strengths in question answering, multi-turn chat, and code generation. Its underlying training corpus and architecture facilitate high performance in educational settings, research exploration, and development scenarios requiring reasoning or code synthesis. Typical applications include providing concise answers to direct questions, participating in conversational dialogue, and generating code—most notably in Python, as Python was predominant in the coding subset of the training data.
The model is also intended as a platform for studying methods for safe deployment of language models, especially with respect to minimizing toxicity, reducing societal biases, and addressing controllability. As an open research tool, Phi-2 supports investigation into the technical and ethical challenges of deploying SLMs with robust and responsible behaviors.
Limitations and Safety Considerations
Despite its capabilities, Phi-2 carries important limitations common to base language models. It may generate inaccurate factual statements or code, and its responses have not been instruction-tuned, which can lead to errors when handling nuanced or complex prompts. The model’s code generation is predominantly in Python, with limited reliability outside standard library and common packages; verification is recommended when using outputs for less common tasks or languages.
Phi-2’s responses are primarily optimized for standard English. Informal English, slang, and other languages may result in misunderstanding or lower quality outputs. While careful filtering and "textbook-quality" bias mitigation strategies were applied during training, the model can still exhibit societal biases or generate toxic content if explicitly prompted, as evidenced by ToxiGen benchmark evaluations.
Phi-2’s safety evaluation on the ToxiGen benchmark (green) versus Phi-1.5 (blue) and Llama-2 7B (gray) across demographic and condition-based categories.
As a research release, Phi-2 is not recommended for production use without additional safety and accuracy evaluations, as outlined on the model documentation.
Development History and Model Family
Phi-2 is the third major release in the Phi series, following Phi-1 and Phi-1.5. Phi-1 established a strong baseline in Python coding tasks, while Phi-1.5 expanded coverage to general reasoning and language comprehension, leveraging "textbook-quality" data. Phi-2 builds upon its predecessor by integrating scaled knowledge transfer and enhancing benchmark performance through improvements in data quality, filtering, and training efficiency. The culmination of these developments was formally announced at Microsoft Ignite 2023.
The Phi-2 model is licensed under the MIT license, promoting broad research and development use.