Alan D. Thompson
February 2023
Summary: Chinchilla showed that we need to be using 11× more data during training than that used for GPT-3 and similar models. This means that we need to source, clean, and filter to around 33TB of text data for a 1T-parameter model.
How much text data should we use when training a text-based large language model (LLM)?
Over the last three years to 2023, there have been a few discoveries, through a process of trial and error…
(Note: There is a complementary scaling law for compute built in to these findings, but this is outside the scope of my current focus.)
In May/2020, OpenAI (GPT-3 paper) tacitly announced their data scaling laws (also called the Kaplan scaling laws) for LLMs:
In plain English, GPT-3/Kaplan scaling laws said that…
300B tokens can be used to train an LLM of size 175B parameters
So, we need around 1.7 text tokens per parameter
In Sep/2022, DeepMind (Chinchilla paper) found new data scaling laws (also called the Chinchilla or Hoffman scaling laws) for ‘data optimal’ LLMs:
In plain English, Chinchilla/Hoffman scaling laws say that…
1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters
So, we need around 20 text tokens per parameter
Therefore, to make GPT-3 data optimal, and…
Keeping the original 300B tokens, GPT-3 should have been only 15B parameters (300B tokens ÷ 20).
This is around 11× smaller in terms of model size.OR
To get to the original 175B parameters, GPT-3 should have used 3,500B (3.5T) tokens (175B parameters x 20. 3.5T tokens is about 4-6TB of data, depending on tokenization and tokens per byte).
This is around 11× larger in terms of data needed.
The data optimization scale continues for model sizes measured in trillions of parameters, and training data measured in quadrillions of text tokens or petabytes of text data. The table and explanation below originally appeared in the Jun/2022 report, The sky is bigger than we imagine.
Text for indexing
Model size (params) |
Training tokens (round) |
Training data used (estimate) |
How much data is that? If 1 book is about 500KB of text (estimate) |
---|---|---|---|
Chinchilla/ 70B |
1.4 Trillion | 2.3TB | More books than in… The Kindle store on Amazon US (6.4M). |
250B | 5 Trillion | 8.3TB | All 30 libraries at Yale University (16.6M). |
500B | 10 Trillion | 16.6TB | The Google Books collection (33.2M). |
1T | 20 Trillion | 33.3TB | The US Library of Congress (66.6M). |
10T | 200 Trillion | 333TB | All US public libraries combined (666M). |
100T | 2 Quadrillion | 3.3PB | All bibles ever sold worldwide (6.6B). |
250T | 5 Quadrillion | 8.3PB | A stack all the way to the Moon (16.6B). |
500T | 10 Quadrillion | 16.6PB | 4 books about every living human (33.2B). |
Table: Dataset sizes needed to align with Chinchilla data optimization for models.
Note: Text estimates only, multimodal data not shown. Jun/2022. LifeArchitect.ai
There are a few caveats to my approximate numbers in the table above. Firstly, the ‘More books than in…’ examples are provided for text-based book data only (no pictures), and this assumes that books are about 500KB each without images. We are now of course exploring training AI with multimodal data: images, music, control signals (robots, button presses), and anything else we can get our hands on. These increasing sizes are also using simplified and rounded estimates only, based on the new findings related to model scaling using more data (measured by number of tokens, which are roughly equivalent to words).
In 2010, Google estimated that there are only 130M unique published books in existence, so past 1T parameters (20T tokens), training data collection would naturally have to rely on alternative text-based and multimodal content. At brain-scale parameter counts of 500T (10Q tokens), the estimated book count would be over 250 times the number of books published, or more than four new books written about each living human on Earth!
Fundamentally, it should not be an incredibly onerous process to collect petabytes of high-quality and filtered multimodal data (converted to text), though that task has not yet been accomplished by any AI lab to date (Jun/2022).
Viz of selected models showing tokens:parameters ratio
Table of current models showing tokens:parameters ratio
Summary of current models: View the full data (Google sheets)
Download PDF version
It is expected that 2023 large language models will continue to follow the Chinchilla scaling laws, though there will be new discoveries about data optimization and data use during training. For example, there is some research on whether or not data can ‘repeat’ (be seen more than once) during training, which may help alleviate the amount of data required to be sourced.
DeepMind models to Dec/2022
Videos on scaling and Chinchilla models
Get The Memo
by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Thousands of paid subscribers. Readers from Microsoft, Tesla, Google AI…
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.
Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 2.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. He is open to consulting and advisory on major AI projects with intergovernmental organizations and enterprise.
This page last updated: 4/Feb/2023. https://lifearchitect.ai/chinchilla/↑
Leave A Comment