Abrupt skill emergence in Large Language Models

Will Seltzer

Aug 31, 2023

This post assumes you’re familiar with the Neural Scaling Laws we discussed last week. If you’re not, read more here:

How much better can Large Language Models get?

Will Seltzer

August 23, 2023

Read full story

We can predict general Large Language Model performance as a function of compute, dataset size, and parameter count.

asddfad — The model’s error (“test loss”) decreasing as we increase the compute, dataset size, and parameter count in tandem. From **Predictability and Surprise in Large Generative Models.**

However, if we shift our focus from general performance (next-token prediction) to specific tasks like adding numbers or writing code, the picture changes.

We do not have scaling laws for task-specific performance. Instead, graphs of specific model capabilities versus parameter count show abrupt, emergent jumps in proficiency.1

For example, below 10¹⁰ parameters, GPT-3 fails to add two 3-digit numbers.

from Emergent abilities of large language models

Yet the moment GPT-3 grows to 10¹⁰parameters, accuracy jumps from 0% to 20%, and upon growing to 10¹¹ parameters, the model can add numbers 80% of the time.

GPT-3 begin to learn 3-digit addition at ~10¹⁰ parameters

Other model skills exhibit similar abruptness. Consider the “Word unscramble” task, which is what you call getting the model to play Scrabble if you’re looking to publish.

I give the model the prompt:

The word hte is a scrambled version of the English word ＿

And the model — if it’s capable — says “the.”

However, models aren’t capable Word Unscramblers at any FLOP count below ~10²². This is the critical threshold, which is like the critical threshold for melting ice — 32 degrees Fahrenheit. If you try to melt an ice cube by heating it from 20 degrees Fahrenheit to 25 degrees, you might as well do nothing at all.

Similarly, to increase Word Unscramble performance, we need to add compute until we hit ~10²²FLOPs, and once we do, models — suddenly and unpredictably — begin learning how to unscramble words.2

These abrupt jumps in capability are not unique to Word Unscramble. Look at the Persian QA graph above. At 10¹⁸FLOPs, the model doesn’t speak Persian. At 10²⁰FLOPs, it still doesn’t speak Persian. But at ~10²⁴FLOPs, it speaks Persian.3

What other capabilities might emerge as we train larger models? PhD-level cancer research abilities? Superhuman manipulation? Given the increasingly large compute budgets for training models — Inflection recently announced their intention to build one of the world’s largest clusters with 22,000 Nvidia H100s — we’re about to find out.

In most cases. OpenAI was able to predict specific coding capabilities for GPT-4. “In addition to predicting final loss, we developed methodology to predict more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset [43], which measures the ability to synthesize Python functions of varying complexity. We successfully predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained with at most 1, 000× less compute (Figure 2).”

This is specific to the model architectures and datasets in the paper. It isn’t known whether learning Word Unscramble at this FLOP count is a general feature of transformers.

You might be wondering why I claim the model doesn’t speak Persian at lower FLOPs even though it gets 25% on the task. 25% is no better than random guessing since the model must choose the correct answer from four choices.

Intuitive AI

How much better can Large Language Models get?

Discussion about this post