Browse Models
The simplest way to self-host Phi-3.5 Vision Instruct. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Phi-3.5-vision-instruct is a 4.2B parameter multimodal model that processes both images and text. It excels at OCR, chart analysis, and multi-image comparison tasks. The model combines an image encoder with the Phi-3 Mini language model and was trained on 500B tokens including filtered web data and image-text pairs.
The Phi-3.5-vision-instruct model represents a significant advancement in lightweight multimodal AI, combining powerful visual understanding capabilities with efficient architecture. As part of Microsoft's Phi-3.5 model family released in August 2024, this 4.2B parameter model demonstrates competitive performance against larger models while maintaining a smaller footprint.
The model's architecture integrates several key components:
With a 128K context length (in tokens), the model leverages flash attention by default, requiring specific GPU hardware like NVIDIA A100, A6000, or H100. The training process involved 500B tokens of data and took 6 days using 256 A100-80G GPUs.
The model's training incorporated diverse data sources:
Privacy considerations were paramount during training, with personal data being removed or scrubbed. The model excels at several key tasks:
When compared to other models like LlaVA-Interleave-Qwen-7B, InternVL-2-4B, and larger models, Phi-3.5-vision-instruct shows impressive performance across various benchmarks including MMMU, MMBench, TextVQA, BLINK, and Video-MME. While GPT-4 variants sometimes outperform it, the model demonstrates particular strength in multi-frame capabilities and video summarization tasks.
For optimal performance, users should configure:
num_crops=4
for multi-frame inputsnum_crops=16
for single-frame inputsThe Phi-3.5 family includes several variants:
While Phi-3.5-mini excels at multilingual support with capabilities in over 20 languages, Phi-3.5-vision focuses specifically on visual understanding tasks. The MoE variant utilizes 16 experts with 3.8B parameters each, activating 6.6B parameters during inference for enhanced performance.