Browse Models

MosaicMl /

MPT-7B

Family

MPT

Type

Foundation Model

License

Apache-2.0 License

Released

2023-05-04

How To Use

Laboratory OS

Launch a dedicated cloud GPU server running Laboratory OS to download and run MPT-7B using any compatible app or framework.

Direct Download

Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on local system resources, particularly GPU(s) and available VRAM.

Browse Compatible Apps

open-webui /

Open WebUI

Open WebUI is an open-source, self-hosted web interface with a polished, ChatGPT-like user experience for interacting with LLMs. Integrates seamlessly with local Ollama installation.

oobabooga /

Text Generation Web UI

The most full-featured web interface for experimenting with open source Large Language Models. Featuring a wide range of configurable settings, inference engines, and plugins.

Model Report

MosaicMl / MPT-7B

MPT-7B is a 6.7 billion parameter decoder-only transformer model developed by MosaicML, trained on 1 trillion tokens of English text and code. The model features FlashAttention and ALiBi for efficient attention computation and extended context handling, enabling variants like StoryWriter-65k+ to process up to 65,000 tokens. Released under Apache 2.0 license, it serves as a foundation for further fine-tuning across various applications.

Explore the Future of AI

Your server, your data, under your control

The Mosaic Pretrained Transformer 7B (MPT-7B) is a large-scale, decoder-style transformer model introduced by MosaicML on May 5, 2023. Distinguished by its open-source, commercially usable license and technical optimization for efficient training and inference, MPT-7B is a foundational member of the MPT model family. It was pretrained on a substantial dataset of English text and code, designed to promote adaptability as a base for further fine-tuning across varied downstream tasks.

Model Architecture and Design

MPT-7B employs a decoder-only transformer architecture, tailored for stability and high performance. Central to its design are two major innovations: the use of FlashAttention for accelerated attention computations and Attention with Linear Biases (ALiBi) for efficient handling of both short and very long sequences. Unlike traditional transformers that rely on positional embeddings, ALiBi allows the model to extrapolate to much longer contexts than those seen during training while maintaining memory efficiency.

The model comprises approximately 6.7 billion parameters, organized in 32 layers with 32 attention heads per layer and a model dimension of 4096. Bias terms are omitted, and optional QK LayerNorm enhances training efficiency. MPT-7B is trained using the LION optimizer, selected for stable update magnitudes and reduced optimizer state memory compared to more commonly used optimizers such as AdamW.

Training Data and Methods

MPT-7B was trained on a mixture of large-scale public datasets totaling 1 trillion tokens, encompassing both English natural language and code. This training set size is comparable to other leading open-source models. Key data sources include mC4, C4 with semantic deduplication, RedPajama CommonCrawl and Wikipedia splits, The Stack code corpus, S2ORC scientific papers, and several others.

Efficient training was made possible through the StreamingDataset library, which streams data during training and enables seamless resumption following interruptions. Tokenization leverages the EleutherAI GPT-NeoX-20B tokenizer, chosen for its robust handling of code and text, space delimitation, and improved compression of whitespace sequences. The vocabulary size of 50,432 is a multiple of 128, producing marginal improvements in model flop utilization as highlighted by Karpathy.

Training Procedure and Infrastructure

The pretraining of MPT-7B was conducted on clusters comprising 440 Nvidia A100-40GB GPUs. The process lasted approximately 9.5 days and utilized a batch size of 1,760 samples, each with a sequence length of 2,048 tokens. Training employed sharded data parallelism via FullyShardedDataParallel (FSDP) and was managed on the MosaicML platform, which provided automated fault tolerance. The pretraining run cost an estimated 200,000 USD.

Training cost and resource table for MPT-7B

The platform's robust design automatically handled hardware interruptions without manual intervention, as illustrated by the training logbook and evident in the smooth recovery demonstrated on the loss curve.

Performance and Benchmarks

MPT-7B was evaluated across 11 standard academic tasks using the LLM Foundry evaluation framework, ensuring that prompt strings or prompt tuning did not bias results. The model's zero-shot performance is competitive with other open-source models of similar scale, including LLaMA-7B, Pythia, and StableLM.

Chat with MPT-7B demonstrating text generation

Through the integration of FlashAttention and infrastructure adaptations like NVIDIA's FasterTransformer, MPT-7B delivers fast inference—reported as 1.5x to 2x faster than comparable models out-of-the-box.

Variants in the MPT Family

The foundational MPT-7B model enables fine-tuning for targeted applications. Notable variants include:

MPT-7B-StoryWriter-65k+, finetuned for long-form fiction, demonstrates extended context handling up to 65,000 tokens using ALiBi. This capacity is illustrated by generating entirely new story endings with substantial input context.

MPT-7B-StoryWriter-65k+ long-form generation

MPT-7B-Instruct, fine-tuned on instruction datasets such as Dolly-15k and Anthropic HH-RLHF, excels at short-form question answering and instruction following.

Instruction following with MPT-7B-Instruct

MPT-7B-Chat is optimized for dialogue, trained on conversational datasets including ShareGPT-Vicuna, HC3, Alpaca, and Evol-Instruct.

Each variant is released under a specific license, detailed in their respective documentation, with the base MPT-7B under Apache 2.0 to allow broad commercial usage.

Applications and Limitations

MPT-7B is intended as a base model for further finetuning rather than for direct deployment in human-facing applications. Its strengths include serving as a starting point for developing custom language models in varied research and industry settings and enabling tasks requiring long-context handling, such as large document summarization or legal text analysis.

Users should be aware that, as with all generative large language models, MPT-7B can produce factually incorrect or undesired outputs and is not suitable for unsupervised deployment without appropriate finetuning or safety mechanisms. Its outputs may reflect biases present in public data sources. For any production or commercial use, due diligence with respect to the licensing terms and further finetuning is recommended.

MPT-7B

Laboratory OS

Direct Download

Open WebUI

Text Generation Web UI

Explore the Future of AI

Your server, your data, under your control

MPT-7B

Laboratory OS

Direct Download

Open WebUI

Text Generation Web UI

Explore the Future of AI

Your server, your data, under your control

Model Architecture and Design

Training Data and Methods

Training Procedure and Infrastructure

Performance and Benchmarks

Variants in the MPT Family

Applications and Limitations

Helpful External Resources