The Fundamental Quantities of LLMs: Part Two - 🖥️ Compute
or: how I learned to stop worrying and love Nvidia.
At a high level
Why LLMs are compute-starved.
How to measure compute: FLOPs.
Video games birthed AI’s hardware.
We’re going to need more GPUs.
“The compute costs are eye-watering.”
This is post 2/5 on the fundamental quantities of LLMs. You can read the first post by clicking here:
Why did OpenAI need to raise an eye-popping $10 billion from Microsoft?
Today, we’ll walk through one of the foundational concepts for understanding the answer to this question: compute.
Why LLMs are compute-starved
A large language model uses billions of parameters to predict the next word. This act is akin to learning to cook a dish with billions of ingredients without knowing the ideal amounts.
In the first post of this series, we walked through a hypothetical example of learning to cook a 9-parameter curry. Let’s revisit this example with a twist: you don’t have the ingredient amounts:
█ lb chicken breast, cut into bite-sized pieces (protein parameter)
█ tablespoons vegetable oil (oil parameter)
█ tablespoons Thai yellow curry paste (curry paste parameter)
█ can (13.5 oz) coconut milk (coconut milk parameter)
█ cup diced bell pepper (bell pepper parameter)
█ cup diced onion (onion parameter)
█ teaspoons fish sauce (fish sauce parameter)
█ teaspoon sugar (sugar parameter)
█ red Thai chilies, finely chopped (spiciness parameter)
If you started with random values for these ingredient amounts and adjusted them as you went, how many times would you need to cook the dish to learn the ideal amounts? Maybe ten? Twenty? Probably under a hundred before you were satisfied with the flavor.
What about a recipe with thirty or so ingredients, like Chicken Biryani? Or a recipe with one hundred ingredients?
For the Biryani, you might get away with cooking it only a hundred times, but for a recipe with a hundred ingredients? You’re looking at cooking it a thousand times or more.1
Now imagine you’re in the LLM’s position, which is equivalent to cooking a recipe of a billion ingredients without knowing the ideal quantities. You’d have to cook it billions, if not trillions, of times before you become even halfway decent.
This is the magnitude of training a large language model is dealing with.
How do large language models manage to practice so many times and so quickly? The answer is we run them using an enormous number of computers, or, more abstractly, we say they use a ton of “compute.”
What does this mean, and how can we even begin to think about it?
Enter FLOPs.
How to measure compute: FLOPs
Let’s start with a small thought experiment.
Our friend, Lisa
Suppose you have a friend. Call her “Lisa”2. Lisa has been bragging about how quickly she can add numbers in her head, and you’re curious about how fast she is.
So you tell her, “Ok, let’s sit down. I’ll give you a pen and paper and some numbers, and for the next fifteen minutes, I’ll time you on how fast you can add two numbers together.”
Lisa obliges. She sits down, grabs the pen, and you start the timer.
After fifteen minutes, the timer stops. You look over at Lisa’s paper, and she’s done nine hundred additions over the fifteen minutes or, as you formalize it, “900 ADDs”.
Since you want to know how fast Lisa is, you calculate that nine hundred additions over fifteen minutes is one ADD per second. That’s pretty fast.
You formalize this, too, as “ADD/s=1” and write it down.
Our friend, the computer
We use a metric similar to ADDs when discussing how much work computers do: FLOPs. FLOPs are like ADDs but slightly more complex.
FLOP: floating point operation
“FLOP” stands for “floating point operation.” You can think of a “floating point” number in a computer as a simple decimal number and “operation” as addition, subtraction, division, or multiplication.
Some examples of a single FLOP are:
2.7 + 3.1
2.1 / 3.2
2.0 + 3.2
We can use FLOPs to quantify not only the amount of work computers do but also how fast they are. We do this by counting how many FLOPs a computer can do in a second and denote this with “FLOP/s.”
FLOP: an addition, subtraction, multiplication, or division of two decimal numbers.
FLOP/s: the number of FLOPs a computer can perform per second.
Most people don’t know the FLOP/s of their computer or phone because measuring your computer's speed in FLOP/s is like monitoring your heart rate on a casual run: it isn’t something you need to do.
But if you start training for a marathon, it suddenly becomes useful. Training an LLM is like training for a marathon. Suddenly heart rate, i.e., compute matters.
What’s a good FLOP/s? To develop our intuition, let’s explore the FLOP/s of some standard devices.
Setting the FLOP baseline
GFLOP: a gigaflop or one billion flops
Apple Watch Series 6: The paper airplane of devices. It's lightweight, goes a short distance, and isn't built for speed — but it accomplishes the simple task it's designed for. 2.03 GFLOP/s.
Samsung Galaxy S21 (Qualcomm Snapdragon 888): A well-thrown football to the Apple Watch’s paper airplane. It carries more power and covers more distance at a faster speed. 77 Apple watches worth of compute. 144 GFLOP/s.
iPhone 14 Pro Max (Apple A16 Bionic): An arrow zipping through the air with remarkable speed and precision. Significantly more potent and faster than the football. 2 Samsung Galaxies worth of compute. 279.8 GFLOP/s.
Dell XPS 13: A race car zooming down the track, engine roaring as it covers long distances quickly. 3 iPhones worth of compute. 840 GFLOP/s .
PlayStation 5: A Boeing 747. 11 Dell laptops worth of compute. 10,280 GFLOP/s.
NVIDIA RTX 3080 (a GPU): A SpaceX Falcon 9 rocket. It uses rocket-grade kerosene and liquid oxygen to light its exhaust, which throws it into space at velocities unimaginable to the previous objects. 15,000 Apple watches worth of compute or 35 Dell laptops, or 3 PlayStations. 29,770 GFLOP/s.
As you can see, there’s a wide chasm of compute performance as we move across device types. Phones are approximately a hundred times faster than smartwatches, laptops are several times faster than phones, and GPUs are about ten times faster than laptops.
What are GPUs, and why are they so fast?
Video games birthed AI’s hardware.
Nvidia debuted the GeForce 256 in 1999. The GeForce 256 was arguably the first modern graphics processing unit (GPU), built to serve the escalating demands of high-resolution video games, such as Quake II and Unreal, whose high pixel counts and intricate lighting effects pushed the hardware limits of Clinton-era CPUs.3
But what exactly is a GPU? Essentially, it's a specialized processing unit that can handle a vast amount of data simultaneously, thanks to its numerous cores. Now, you might ask yourself, “What exactly is a core?”. Great question!
What exactly is a core?
A core is the smallest component that can independently process instructions. To help you visualize this, imagine a room filled with people (like Lisa), each adding numbers on separate sheets of paper. Here, each person is a core, and the room is a GPU. This setup allows for parallel processing. Intuitively, the more people you have in the room, the more numbers that can be added simultaneously.
This is why GPUs have such massive FLOP/s.
We’re going to need more GPUs
Now that we’re familiar with FLOPs and GPUs, let’s examine how many LLMs require. Remember, large language models have billions of parameters to tune, so they need tons of compute.
Grains of sand — the other silicon
Training GPT-3, the model underlying ChatGPT, took 3.14 * 1023 FLOPs.
Estimates suggest that the total number of grains of sand on Earth is around 1018. The amount of FLOPs required for GPT-3 is 100,000 larger than this.
This is a gigantic number. This means that if every grain of sand on Earth were to morph into a tiny computer, each one would have to perform 100,000 arithmetic operations in order to train GPT-3.
Parallelize!
What kind of machine can handle this colossal scale? Our friend, the GPU. Except, we can’t use only one.
If we were to assign this herculean task of training GPT-3 to a solitary Nvidia RTX 30804, it would finish after approximately 1010 seconds or 316 years.
We’d like to finish training sometime in the next century, so we use more. A thousand times more. Or at least that’s what estimates say. OpenAI hasn’t publicly disclosed the training setup they used for GPT-3, but reasonable estimates peg it at ~1,000 A100 GPUs over 30 days.
In other words, if a GPU is a SpaceX Falcon 9 rocket that can typically carry 50,000 lbs, training GPT-3 is the equivalent of launching a small town into space. The Falcon 9 can't handle this monumental task by itself. So, what do we do? We obtain 1,000 rockets and launch pads, divvy up our payload, and launch our town to the stars.
Is GPT-3 unique?
You’re probably wondering whether GPT-3 is unique in requiring massive amounts of compute. It isn’t. Other companies, such as Google and Anthropic, have trained or intend to train models that use similar amounts of compute.
And that amount of compute has been sharply increasing over the last decade.
Training GPT-3 required roughly 100 million petaFLOPs.
PaLM, which you can see in the above graph, Google’s version of GPT-3, took over 2 billion petaFLOPs5.
Anthropic plans to train a behemoth of a model, dubbed “Claude-Next,” using 1025 FLOPs.
That’s absurd. That’s massive. That’s one hundred times as many FLOPs as GPT-3.
Anthropic is willing to take on this massive bet because they believe that “the best 2025/26 models will be too far ahead for anyone to catch up in subsequent cycles”.
You’re probably wondering at this point how much all of these FLOPs cost.
“The compute costs are eye-watering.”
As we covered in the first post of this series, machine learning models do two things: training and inference, and training and inference both incur costs.
Training costs
Estimates currently place a single training run of GPT-3 at $4.6 million. Meta’s LLaMA model, another highly performant LLM, costs $5 million per training run.
How much might Anthropic spend on the 1025 FLOPs to train Claude-Next?
~$460 million…
Here’s the math, step-by-step. Since Claude-Next will use 100x the FLOPs of GPT-3, we’re looking at 100x the GPUs and 100x the price of GPT-3’s $4.6 million, or roughly $460 million.6
This is wild. And inaccurate.
This estimate assumes that Anthropic trains their model on consumer hardware at retail prices, but Anthropic, luckily, inked a deal with Google, which will lower their costs significantly. More thorough estimates that take this deal into account place costs somewhere between $10-150 million per training run.
That’s still an astronomical number. Assuming Claude-Next will require multiple training runs, Anthropic is about to compute some of the most expensive bits that have ever been computed. They’ll likely spend at least $100 million solely on compute. And compute isn’t the only cost. They’ll also have to pay for training data and for engineers. Once you add in these other costs, the total skyrockets to $1 billion just to train the model. One billion dollars. For bits.
And those bits don’t do anything. Not until you load them onto servers and have them serve user requests.
Enter inference costs.
Inference costs
Whenever you ask a question to ChatGPT, a request is sent to an OpenAI server, which contains a copy of the trained model. That server computes and then returns the model’s response. This is inference.
Inference scales with the number of queries. Whenever you ask ChatGPT a question, OpenAI has to serve that request. Each server can only serve so many requests, so once a given server is saturated, OpenAI has to create more.
Spinning up new servers is like setting up new McDonald’s locations. You take the blueprint (model) the architects drew up and replicate it by standing up a new location with new workers.
Assuming you use the same plan as before, you don’t have to pay the architects for a new model (retrain the model), but you do have to hire construction workers, buy building materials, and pay employees to operate the store (inference server).
How expensive is inference? The exact number is unknown. Estimates put OpenAI’s daily inference hardware costs at $700k per day in Feb 2023, and given the increase in ChatGPT usage, these costs have only increased.
This is why OpenAI needed to raise that eye-popping $10 billion from Microsoft. While their employees are expensive — Ilya Sutskever, OpenAI’s chief scientist, made $1.9 million in 2016 — they must pay for compute — both for training and inference. As Sam Altman, OpenAI’s CEO, has said, “The compute costs are eye-watering.”
Part Two - 🖥️ Compute ✅
Part Four - 🧠 Model Size
Part Five - 📚 Data
In the next post of this series, we’ll look at how to measure model performance.
Generally speaking, you get exponentially more parameter value configurations as parameter counts increase. This makes it significantly harder to find optimal parameter values.
Back in the 80s, Apple released a computer called “Lisa.”
When the first GPUs were introduced, CPUs, or central processing units, had only one core. With only one core, the CPU has to perform arithmetic operations sequentially. CPUs nowadays are multicore but still have nowhere near as many cores as GPUs.
We can’t do this, but for the sake of the thought experiment, suppose we can. GPT-3 is too large to fit on a single GPU, and this also assumes full GPU utilization.
For the sake of simplicity, I’m assuming training costs are stable. The reality is they are rapidly decreasing over time. From November 2020 to September 2022, we saw a GFLOP/s drop from $0.047 to $0.019.