LLaMA 65B is a large language model created by Meta AI and released in February 2023 as part of the LLaMA (Large Language Model Meta AI) model suite. With 65.2 billion parameters, LLaMA 65B is among the largest publicly discussed models in the LLaMA family, which also comprises variants with 7B, 13B, and 33B parameters. Developed to promote open research and scientific transparency, LLaMA models are intended primarily for research purposes, broadening the accessibility of advanced natural language processing technologies for academic and scientific communities. The design and training of LLaMA 65B focus on efficiency, versatility, and reliance on publicly available data sources, distinguishing the model from many proprietary systems.
Architecture and Training Techniques
LLaMA 65B is built on the transformer architecture, a widely adopted foundation for large-scale language models. The model incorporates a range of architectural optimizations to improve efficiency and training stability. Notably, it features RMSNorm pre-normalization, a method adapted from GPT-3, applied to the inputs of each transformer sub-layer, which assists in stabilizing the training process.
The activation function utilized is SwiGLU, replacing the traditional ReLU to achieve improved non-linear representational capacity, a technique inspired by advancements in models such as PaLM. Furthermore, LLaMA 65B replaces absolute positional encodings with rotary positional embeddings (RoPE), as adopted from projects like GPTNeo, to enhance the model's handling of sequential data.
Key model statistics include a dimension size of 8192, with 64 attention heads distributed across 80 transformer layers. LLaMA 65B contains 65.2 billion trainable parameters. The model's architecture and implementation leverage the xformers library and numerous systems-level optimizations, including memory-efficient attention, activation checkpointing, and model/sequence parallelism, to enable scalable and efficient distributed training.
Training Corpus and Methodology
A defining aspect of LLaMA 65B is its training data strategy, which emphasizes the exclusive use of publicly available datasets. The model was trained on a diverse corpus totaling 1.4 trillion tokens, spanning numerous sources to ensure wide-ranging knowledge and linguistic coverage. The primary components of the training mixture include English CommonCrawl, the C4 dataset, publicly available code from GitHub, Wikipedia articles, book corpora from Gutenberg and Books3, scientific documents from arXiv, and textual content from StackExchange.
During data curation, each source underwent dedicated preprocessing steps to filter out non-English content, remove low-quality data, and eliminate redundancies. For example, code datasets were filtered for permissible licenses, and books were deduplicated at the volume level. Most data was used for a single epoch, while significant sources such as Wikipedia and Books were included for approximately two epochs to increase coverage.
The tokenization process was performed using the byte-pair encoding (BPE) algorithm, segmenting numbers into individual digits and decomposing unknown characters for improved generality. The model was trained using the AdamW optimizer with a cosine learning rate schedule and further benefited from strategies like gradient clipping and activation recomputation reduction. The total training run required 21 days, utilizing efficient model implementations to process over 380 tokens per second per GPU across a distributed cluster.
Performance and Evaluation
LLaMA 65B demonstrates competitive results across a range of standard natural language understanding benchmarks. In the domain of common sense reasoning, the model outperforms peers such as Chinchilla-70B on most tasks and surpasses PaLM-540B in several areas, with the exception of certain benchmarks like BoolQ and WinoGrande. On closed-book question answering datasets such as NaturalQuestions and TriviaQA, LLaMA 65B achieves high accuracy, particularly in zero-shot and few-shot evaluation settings.
The model also exhibits notable capabilities in reading comprehension, mathematical reasoning (GSM8k and MATH), and code generation tasks, even without domain-specific fine-tuning. On benchmarks like HumanEval and MBPP, commonly used for program synthesis evaluation, LLaMA 65B achieves 23.7 (pass@1) and 37.7 (pass@1), respectively—outperforming other general language models like LaMDA and PaLM.
On the Massive Multitask Language Understanding (MMLU) benchmark, the model reaches an average accuracy of 63.4% (5-shot setting). However, its performance on MMLU trails models like Chinchilla-70B and PaLM-540B, potentially reflecting differences in pre-training data composition. In metrics aimed at evaluating truthfulness and factuality, such as TruthfulQA, LLaMA 65B provides more truthful responses relative to GPT-3, though a notable rate of hallucinated or incorrect answers remains.
Limitations and Considerations
LLaMA 65B shares many of the inherent challenges observed in other large language models. The generation of outputs reflecting bias, toxicity, or misinformation remains an unsolved problem, as the model can inadvertently learn and propagate patterns present in the training data. For instance, evaluations indicate that toxicity levels tend to increase with model size, as shown by higher RealToxicityPrompts scores in LLaMA 65B compared to its smaller counterparts.
Regarding societal biases, LLaMA 65B demonstrates evidence of gender bias in pronoun disambiguation tasks, performing less accurately on cases where grammatical or occupational gender associations defy common stereotypes. The tendency for hallucinations or fabrication in factual queries is also observable, impacting areas such as closed-book QA and benchmarks demanding precise truthfulness.
Another consideration is the model’s exclusive dependence on openly accessible data, which, while enhancing transparency and reproducibility, may limit performance in certain academic or specialized benchmarks. These limitations are further reflected in the model’s model card and documentation.
Environmental Footprint and Release
The training of large models like LLaMA 65B entails a considerable computational and environmental cost. For LLaMA 65B, the estimated energy consumption was 449 megawatt-hours, translating to approximately 173 tCO2eq of carbon emissions. These calculations are based on the use of modern A100 GPUs, power consumption rates, and typical US data center energy intensity parameters.
LLaMA 65B was officially released on February 24, 2023, under a non-commercial research license. Model access is considered on a case-by-case basis for academic, governmental, and civil society researchers. This release strategy reflects a commitment to responsible dissemination and use in line with ethical and scientific guidelines.
Family Models and Future Directions
LLaMA 65B is accompanied by smaller and more computationally accessible variants with 7B, 13B, and 33B parameters. Each model in the LLaMA suite is trained on at least 1 trillion tokens, ensuring substantial linguistic and topical coverage across the family. Notably, LLaMA-13B surpasses GPT-3 (175B) in several benchmark tasks despite its smaller size, illustrating the benefits of efficient training methods and contemporary architectural adjustments.
For ongoing developments and successor models, users may reference the Llama 2 release page, which introduces further advances in model scaling and capabilities.
Helpful External Links