Browse Models
The simplest way to self-host Playground v2 Aesthetic. Launch a dedicated cloud GPU server running Lab Station OS to download and serve the model using any compatible app or framework.
Download model weights for local inference. Must be used with a compatible app, notebook, or codebase. May run slowly, or not work at all, depending on your system resources, particularly GPU(s) and available VRAM.
Playground v2 Aesthetic generates 1024x1024 images using a dual text encoder system (OpenCLIP-ViT/G and CLIP-ViT/L). The model outperformed SDXL in user studies and achieved a 7.07 FID score on the MJHQ-30K benchmark. Available in 256px, 512px, and 1024px variants.
Playground v2 Aesthetic is a state-of-the-art text-to-image diffusion model that generates highly aesthetic 1024x1024 pixel images. The model was trained from scratch by the Playground research team and shares architectural similarities with Stable Diffusion XL. It employs two fixed, pre-trained text encoders - OpenCLIP-ViT/G and CLIP-ViT/L - to process text inputs.
The model is part of a family that includes intermediate base models at 256px and 512px resolutions, which were released to support research in compute-limited environments. These base models serve as stepping stones to the full 1024px aesthetic model, providing researchers with access to different stages of the training process.
Playground v2 demonstrates significant improvements over existing models, particularly Stable Diffusion XL. User studies involving over 2,600 prompts and thousands of participants showed that Playground v2's outputs were preferred 2.5 times more frequently than SDXL's.
To evaluate aesthetic quality systematically, the team introduced a new benchmark called MJHQ-30K. This benchmark uses FID (Fréchet Inception Distance) scores calculated on a high-quality dataset of 30,000 Midjourney images across 10 categories. On this benchmark, Playground v2 achieved an overall FID of 7.07, significantly outperforming SDXL-1-0-refiner's score of 9.55.
When using Playground v2 with Hugging Face Diffusers, a guidance_scale
parameter value of 3.0 is recommended for optimal results. The model is available under the Playground v2 Community License, which allows for commercial use.
The model weights are available in the safetensors format, making them compatible with popular interfaces like Automatic1111 or ComfyUI. For researchers and developers working with limited computational resources, the intermediate base models (256px and 512px variants) provide lighter alternatives while maintaining core capabilities.
The release of Playground v2 represents a significant advancement in text-to-image generation, particularly in terms of aesthetic quality and image-text alignment. The model's superior performance on both human preference studies and the MJHQ-30K benchmark establishes new standards for image generation quality.
The public release of intermediate checkpoints from different training stages, along with the MJHQ-30K benchmark dataset, provides valuable resources for the research community to advance the field further.