Everything We Know About GPT-4

by Stephen M. Walker II, Co-Founder / CEO

Top tip

Q3 2023: This model card represents the combined publicly available GPT-4 model research – we stand on the shoulders of giants (and aggregate their data points) in this analysis.

GPT-4 – 2023's State of the Art LLM

GPT-4 represents a major leap forward in large language model capabilities. Developed by OpenAI, it builds on the architecture and strengths of GPT-3 while achieving new levels of scale and performance.

With GPT-4, OpenAI's objective was to create a model that was over 10x larger than GPT-3. This requires not just greater compute for training, but entirely new approaches to model architecture and inference serving.

Some key facts about GPT-4:

  • Total parameters: ~1.8 trillion (over 10x more than GPT-3)
  • Architecture: Uses a mixture of experts (MoE) model to improve scalability
  • Training compute: Trained on ~25,000 Nvidia A100 GPUs over 90-100 days
  • Training data: Trained on a dataset of ~13 trillion tokens
  • Inference compute: Runs on clusters of 128 A100 GPUs for efficient deployment
  • Context length: Supports up to 32,000 tokens of context

Top tip

Review GPT-4 outputs carefully before use, as the model can generate harmful, biased, or factually incorrect text without proper oversight.

GPT-4 Model Card

Model Details

ParameterDetail
OrganizationOpenAI
Model nameGPT-4
Model typeTransformer with Mixture-of-Experts
Parameters1.8 trillion
Context Window8-32,000 tokens
Launch DateMarch 2023
Current Version1.1 (Release 06.13)
Training dataset13 trillion tokens (web text, books, other)

Compute

ComputeDetail
Training90 days on 25,000 Nvidia A100 GPUs
Inference128 A100 GPU clusters

Training Data

ParameterDetail
Data sourcesCommonCrawl, WebText2, books, Wikipedia, Reddit, Amazon reviews
Data volume~13 trillion tokens
Data prepDeduplication, cleaning, filtering
Potential biasesLanguage, gender, race representation

API and Data Format

  • Chat Completion API
  • Multi-turn message types
  • System, Function, User, Assistant
  • JSONL fine-tuning with message arrays

Intended Use

  • Text generation
  • Question answering
  • Classification
  • Conversational agents

Factors

  • Language: English
  • Capabilities: Text generation, question answering, text classification
  • Modalities: Text
  • Ethical considerations: Potential for bias, harmful outputs, misuse

Metrics

  • Perplexity: Unknown
  • F1: Unknown
  • Accuracy: Unknown

Limitations

  • Fine-tuning unavailable (GA release target October 2023)
  • Potential for harmful, biased outputs
  • Lack of grounded reasoning
  • Factually incorrect outputs
  • Model mistakes as truth

Performance Controls

  • Temperature
  • Top-k sampling
  • Top-p sampling

Language Supports

GPT-4 was tested on a translated version of the MMLU benchmark in 26 different languages. It outperformed GPT-3.5 and other LLMs in 24 out of the 26 languages tested, including low-resource languages like Latvian, Welsh, and Swahili. The Datacamp and MakeUseOf articles also note GPT-4's multilingual capabilities, with support for translation between English, French, German, Spanish, Chinese, Japanese, Korean and more. Translated Labs points out that GPT-4 has disparities in performance between English and other languages due to the predominance of English in its training data. Their T-LM product helps address this by translating prompts to enhance GPT-4's capabilities in 200 languages.

Ethical Considerations

GPT-4 has potential for misuse and harmful societal impacts. Review outputs carefully before use. Do not treat as factual statements. For questions or concerns, contact safety@openai.com

Model Architecture

The model architecture of GPT-4 moves away from a standard transformer approach. Instead it utilizes a mixture of experts (MoE) design.

In the MoE architecture, there are separate expert neural networks that specialize in certain tasks or data types. For each inference query, the appropriate expert models are selected to handle that specific input.

This provides two major advantages:

  1. The overall model can scale up in size significantly, while only routing inference through a small subset of expert parameters for any given query. This keeps inference costs practical.

  2. The mixture of experts can develop specialized knowledge, improving overall capabilities.

Specifically, GPT-4 consists of:

  • 16 expert models, each with ~111B parameters
  • 2 experts are activated per inference query
  • 55B shared parameters for attention
  • Results in ~280B parameters used per inference pass

Top tip

It is likely that this architecture prevents a true Temperature 0 setting resulting in inference variance caused by the sampling and routing to mixture of experts. Additionally the CUDA driver floating point operations are non-additive. This theory was confirmed in a 1:1 discussion with Stephen Wolfram in September 2023.

This architecture allows GPT-4 to reach over 1.8 trillion parameters in total, while only utilizing several hundred billion per query.

Training

Training a model as large as GPT-4 requires extensive computational resources. It pushed the limits of existing infrastructure.

Key facts about the GPT-4 training process:

  • Trained on ~25,000 Nvidia A100 GPUs simultaneously
  • The batch size increased over time, eventually reaching 60 million tokens
  • Trained for a total of 90-100 days continuously
  • Required 2.15e25 floating point operations (FLOPs) in total
  • Trained on a dataset of ~13 trillion tokens

To make this feasible, extensive parallelism techniques were used:

  • 8-way tensor parallelism to distribute the model across GPUs
  • 15-way pipeline parallelism to split batches into stages
  • Various clustering topologies to maximize inter-GPU bandwidth

The result was one of the largest compute jobs ever for an AI model.

Inference

Deploying GPT-4 for inference at scale is a significant challenge due to its size and mixture of experts architecture. Efficient inference directly impacts costs.

Key facts about GPT-4 inference:

  • Runs on clusters of 128 A100 GPUs
  • Leverages 8-way tensor parallelism and 16-way pipeline parallelism
  • Carefully balances latency, throughput, and utilization
  • May use speculative decoding to improve throughput by 2-3x
  • Multi-query attention reduces memory needs for long contexts

Inference clusters are designed to maximize throughput and hardware utilization. This keeps costs lower per query.

There are still challenges around consistently batching queries for diverse expert models. But overall, the infrastructure can effectively deploy GPT-4 without pricing becoming prohibitive.

Understanding Token Dropping in GPT-4

The mixture-of-experts (MoE) architecture used in GPT-4 relies on a token routing mechanism to determine which experts process each token. This can lead to certain tokens being "dropped" or unprocessed.

GPT-4 uses a simple top-2 token routing approach, where each token is sent to the 2 most likely experts according to the router. The experts themselves have a set capacity limit on how many tokens they can process per batch.

When aggregated across long input sequences and large batch sizes, the expert capacity is often exceeded, resulting in tokens being dropped. Counterintuitively, some level of dropping is actually beneficial for model performance and efficiency, as it prevents overloading experts.

The drops are non-deterministic - running the same prompt twice can lead to different drops each time. This is because the tokens are dropped differently across batches depending on capacity. The model itself remains deterministic.

While OpenAI could tweak expert capacity and reduce drops, this would substantially increase inference time and cost. The current tradeoff enables inexpensive deployment at scale. Dropping is inherent to sparse MoE designs.

Understanding how routing leads to drops provides insight into observations of randomness in GPT-4. The drops vary across usages, but the model logic itself does not.

The Future

GPT-4 demonstrates impressive progress in language model foundations. However, future models will likely need to expand beyond a purely text-based approach.

Some areas of focus moving forward:

  • Architectures that natively support vision, audio, speech, and text together
  • Training models end-to-end across different data modalities
  • Expanding beyond mixtures of experts for greater scalability
  • Increasing training data diversity and size by orders of magnitude
  • Advancing multi-modal capabilities for complex reasoning
  • Optimizing model designs for real-world task performance

With each generation, OpenAI is pushing closer towards artificial general intelligence. While they are further along than any other LLM/AI research company, we are still far away from true general intelligence, lacking key attributes such as volition, decision making, memory, real-time knowledge synthesis, and other attributes.

GPT-4 shows they have the technical capabilities to make massive leaps forward with each iteration.

The future capabilities of these models remain incredibly exciting.

Research Sources

More articles

Guide: Getting started with Klu Python SDK

Build your first Klu Action with the Python SDK in 5 Steps

Read more

Guide: Getting started with Klu Typescript SDK

Build your first Klu Action with the Typescript SDK in 5 Steps

Read more

Get notified when we unlock access.

Public access to Klu is coming Fall 2023. Be the first to get access.

Get Notified