The Fundamental Quantities of LLMs: Part Three - π Model Performance
Google might have a moat
At a high level
The open-source large language model that βchallengedβ ChatGPT, Vicuna-13B.
Model IQ: measuring general LLM performance.
ChatGPT is my doctor: measuring specific LLM performance.
This is post 3/5 on the fundamental quantities of LLMs. You can read the first post in the series below:
Back in early May, semianalysis, a semiconductor blog, published a leaked internal Google document that proclaimed, βWe have no moat, and neither does OpenAIβ and went on to paint a bleak portrait of Googleβs future:
While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly. Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months. β leaked internal Google doc
Sounds bad. Sundar is in trouble. This, at least, was the general sentiment echoed by The Information, TechCrunch, and Fortune, who covered the memo.
But that sentiment is misguided. In this post, weβll dive into why and scrutinize the linked, purportedly highly-performant and open-source model, Vicuna-13B1. Then, weβll use our Vicuna-13B discussion as a springboard to discuss large language model performance, generally.
Vicuna-13B
The vicuna is a relative of the camel (part of the biological family Camelidae) that lives in the high alpine areas of the Andes. Vicuna-13B, in contrast, is a model trained by LMSYS, a research collaboration between students and faculty at UC Berkeley, UCSD, and CMU that reportedly achieves near-ChatGPT performance.
Letβs hear about Vicuna-13B's performance from LMSYS themselves.
Thatβs astounding. 90% of the quality of ChatGPT and Google Bard, and all for $300. For comparison, estimates place the cost of ChatGPT at $4.6 million. Pray for Sundar.
Or maybe not. That 90% has an asterisk. An asterisk is a warning sign, a siren. If you ever see a β*β on a food product, itβs usually because the asterisked claim is misleading at worst and unverified at best.
Consider Poptarts, which isnβt, to my surprise, made with real fruit.
Sadly, Vicuna-13B has a similar moment if you read the fine print.
With this additional context, how should we begin to interpret the original 90% number? How did LMSYS do their analysis? What does βachieve 90% quality of OpenAI ChatGPTβ even mean?
The Art of Model Comparison
Common approaches for evaluating large language models include head-to-head metrics, general capability metrics, and specific capability metrics.
Head-to-head metrics score models by comparing their relative performance.
General capability metrics give models a single score. This score is an estimate of general ability and is akin, conceptually, to IQ2 or an SAT score.
Specific capability metrics measure specific skills, like logical reasoning, historical knowledge, mathematical ability, medical knowledge, and programming ability.
Vicuna-13B achieved 90% of ChatGPTβs performance as measured by the Elo rating system β one type of head-to-head rating system.
In the above leaderboard, you can see Vicuna-13B scoring 1,061 under the Arena Elo rating. GPT-3.5.-turbo, i.e., ChatGPT, comes in at 1,130.
Letβs walk through how these scores are computed and what they mean.
Whatβs Elo?
Hungarian physicist Arpad Elo developed his eponymous system to rank chess players. The idea is pretty simple. Your score increases when you win and decreases when you lose, but the amount it changes is scaled by your opponentβs skill. If you sign up for chess.com, your Elo score will start at 400.
Letβs suppose you start playing chess on chess.com, and your opponent is a chess grandmaster with a rating of 2500.
Somehow, miraculously, you win. Perhaps you wore your lucky shirt or made a sacrificial offering to Magnus Carlsen. However you did it, your score goes up by around 30-40 points.
If you had, instead, won against an Elo-400 player, your score would have increased by around 15 points.
Thatβs all Elo is, stripped to its conceptual bones.3
Using Elo for AI models
The game the LMSYS models are playing isnβt chess. Instead, itβs an artificial matchup, where a user is presented with two responses from two different models and must select which one they prefer. A model wins if the user prefer its response.
A model loses if the user votes for its competitor instead.
Thereβs an issue with using Elo for this task that we can illustrate with an example. But first, take a moment and decide which response you prefer below before scrolling further.
If you picked βA is better,β nice job. If you didnβt, youβre in good company. According to Pew, only 26% of Americans know that Nigeria is Africaβs largest country by population.
Pretend youβre an LLM in this match-up. You win by winning votes, so your win conditions are linked to the userβs evaluation and knowledge. But users may not know enough to properly evaluate your response, especially if those users are random internet users.
Other stats from Pew: only 74% of Americans know the dominant religion in Latin America is Catholicism. Only 67% recognize the symbol of the Euro. Only 41% recognize the Indian flag.
Researchers are also weighing in against crowdsourced Elo rankings. In The False Promise of Imitating Proprietary LLMs, Gudibande et al. found4 that models fine-tuned on ChatGPT
get rated positively by crowdworkers because they are adept at mimicking ChatGPTβs styleβthey output confident and well-structured answers. However, their factuality is weak, and crowdworkers without domain expertise or significant time investments may miss these errors.
In other words: Vicuna-13B is a fantastic bullshitter.
All of that said, if Elo isnβt The Way, what is?
Model IQ: Measuring general capability
Our schools donβt use a crowdsourced Elo ranking procedure. For good reason.
Hereβs what using Elo would be like if we used it in schools.
Suppose weβre testing two students.
Call them βIlyaβ and βMiraβ.
Ilya and Mira stand at the front of the classroom and are asked questions like βWhatβs the Pythagorean theorem?β, βWho was Abraham Lincolnβs Vice President?β and βWhatβs the most populous city east of Moscow?β.
Each time Ilya and Mira give their answers, a random student from the class acts as judge. The judge-student decides who gave a better answer, and we record it. There is a caveat: we donβt know if the judge-student knows the answer.
But we proceed anyway, run through all of the questions, and count the votes. Then we score Ilya and Mira on their performance.
This method would be an absurd way to evaluate students. Luckily, our schools instead test students on established facts.
We can do the same for models. OpenAI took this approach in the GPT-4 technical report.
There are too many evaluation metrics for us to cover in detail here, so letβs zoom in on the first one, the MMLU, which is fairly common and also used by LMSYS in their rankings.5
MMLU
βMMLUβ stands for βMulti-task language understanding.β It contains 15,908 questions covering 57 subjects across STEM, the humanities, and the social sciences with difficulty ranging from βElementaryβ to βHigh Schoolβ to βCollegeβ to βProfessional.β
Questions on the MMLU range from medicine
to microeconomics
And range in difficulty from elementary subjects
to college-level ones.
How well do models do on the MMLU?
Recent large language models do quite well on the MMLU: GPT-4 scores an 86.4%, Anthropicβs Claude-v1 scores a 75.6%, GPT-3.5 scores 70%, and Metaβs Llama-65B comes in at 68.9%. Vicuna-13B, the model that has 90% of GPT-3.5βs Elo score, comes in at 52.1%.
In other words: while Vicuna-13B is good at convincing users it knows the answer, it often doesnβt know the answer. Vicuna-13B is an F student to GPT-3.5 (ChatGPT)βs C student.
How well do humans do on the MMLU?
How do humans compare? Unspecialized humans, like those recruited from Mechanical Turk, obtain an accuracy of 34.5% on the MMLU. However, expert humans6 perform significantly better, scoring 89.8% in their specific domain.
But while the MMLU is good as a first pass for understanding how well a model might do on an arbitrary subject, we can do better by getting more specific.
Specialization is for models
A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.
- Robert A. Heinlein
Specialization isnβt just for insects; itβs also for models.
The MMLU is a general measurement of AI model capability. Even if an AI model performs well on the MMLU, it doesn't necessarily excel at all specific tasks. Let's compare two models, Google's PaLM-2 and Anthropic's Claude-2, as a case in point.
PaLM-2, Google's recently launched LLM, comes in at a score of 78.3% on the MMLU, while Claude-2, an LLM from Anthropic, outperforms slightly at 78.5%. However, model effectiveness can vary dramatically, depending upon the specific task at hand.
Take medicine, for example. Claude-2 only achieves performance in the 60% range on the US Medical Licensing exam-style questions, but a version of PaLM-2, Med-PaLM 2, which was tailored for medical queries, scores an impressive 85%+ accuracy on the MedQA dataset (which consists of USMLE-like questions).
In contrast, Claude-2 is significantly more adept at writing code, achieving a 71.2% on the HumanEval programming challenge, a common programming benchmark, while PaLM-2 clocks in at 37.6%.7
fin
So, next time you read a news story that says
Open-source models are faster, more customizable, more private, and pound-for-pound more capable.
Ask what βcapableβ means. Ask what metric is being used to measure the model, and examine whether that metric is a general capability metric or one that measures performance in a specific domain, like medicine or law.8
Part Three - π Model Performance β
Part Four - π§ Model Size
Part Five - π Data
In the next post of this series, weβll look at the impact of model size on model performance.
Corrections & Updates
Vicuna-13B also cannot legally be used for commercial applications since itβs fine-tuned on Llama-13B, which is not commercially licensed, so itβs odd to suggest that companies can use it instead of Google and OpenAI models, performance considerations notwithstanding. Thank you to David Judd for pointing this out.
You can read more about this model if you click on the link in the sentence, βThey are doing things with $100 and 13B params that we struggle with at $10M and 540Bβ.
Note: Iβm not commenting here on the broader IQ debates by any means, just that IQ attempts to estimate general intellectual capability.
Take this with a grain of salt since itβs a preprint.
Besides MMLU, itβs also worth looking at TruthfulQA, which tracks how often models mimic popular falsehoods, like responding to βIf it's cold outside, what does that tell us about global warming?β with βIt tells us that global warming is a hoax.β
I use the same βexpertβ terminology as the original MMLU paper to mean those humans who scored in the 95th percentile on their specific test.
If you look at the technical report closely, youβll see that PaLM-2 achieves this score on the P@1 metric, which means it was only allowed to generate a single code sample. For more details, check out the paper that introduced the Codex HumanEval task.
There are also non-functional aspects of performance. Even if a model has specific capabilities in an area, it still may not be ready to perform the task in a live setting. For example, the model may not have low enough latency or be able cost-effective enough to be deployed within a specific area.
Love the biology tidbit :)