Summary of What's going on with the Open LLM Leaderboard?

Summary What's going on with the Open LLM Leaderboard? huggingface.co

2,666 words - html page - View html page

One Line

The Open LLM Leaderboard, a public leaderboard for large language models, has sparked discussion due to discrepancies in benchmark numbers and the use of different evaluation methods and implementations.

Slides

Slide Presentation (5 slides)

Copy slides outline Copy embed code Download as Word

What's happening with the Open LLM Leaderboard?

Source: huggingface.co - html - 2,666 words - view

The Open LLM Leaderboard

• Public leaderboard comparing open access large language models

• Evaluation numbers of top model lower than published paper

• Uses Eleuther AI LM Evaluation Harness to run evaluations

• Different implementations give different numbers and change rankings

[Visual: Image of the Open LLM Leaderboard]

Evaluations are tied to implementations

• Comparisons across models and papers difficult

• Prompts and tokenization affect results

• Standardized benchmarks like EleutherAI Eval Harness and Stanford HELM are important

[Visual: Comparison of different evaluation methods]

Importance of standardized benchmarks

• Enable fair comparisons and reproducibility

• Essential for improving language models

• EleutherAI Harness updated to match original MMLU implementation

[Visual: Image of the EleutherAI Eval Harness]

Understanding Open LLM Leaderboard evaluations

• Evaluations are complex and implementation-dependent

• Use standardized benchmarks for fair comparisons

• Update on the Open LLM Leaderboard coming soon

• Keep striving to improve language models

[Visual: Image representing progress and improvement]

Key Points

The Open LLM Leaderboard is a public leaderboard comparing open access large language models.
There was a discussion on Twitter about the MMLU evaluation numbers of the top model on the leaderboard being significantly lower than the numbers in the published LLaMa paper.
The Open LLM Leaderboard uses the Eleuther AI LM Evaluation Harness to run evaluations and store results.
Different implementations of the MMLU evaluation give different numbers and change the ranking order of the models on the leaderboard.
Evaluations are strongly tied to their implementations, including prompts and tokenization, making comparisons across models and papers difficult.
Open, standardized, and reproducible benchmarks like the EleutherAI Eval Harness and Stanford HELM are important for comparing results and improving language models.
The EleutherAI Harness has been updated to match the original MMLU implementation, and the leaderboard will be updated with the new scores.

Summary

266 word summary

Hugging Face has announced new Content Guidelines and Policy. They are updating the Open LLM Leaderboard with an updated version of the EleutherAI Eval Harness. The leaderboard will include scores from the Eleuther Harness v2 in the coming weeks. The evaluation of MMLU in the harness has been updated to match the original implementation. The community around EleutherAI Harness, Stanford HELM, and other evaluation libraries are invaluable for comparing results across models and papers. Evaluations are strongly tied to their implementations, including prompts and tokenization. Different evaluation methods can yield different scores and rankings for models. The blog post invites readers to join the discussion on the Open LLM Leaderboard. The post also discusses the challenges and differences in evaluating models using different implementations, such as the original MMLU implementation, EleutherAI Harness, and Stanford HELM. The Open LLM Leaderboard, a public leaderboard for large language models, has generated discussion due to discrepancies in the reported benchmark numbers. The LLaMA team used the evaluation code from the original UC Berkeley team, but their numbers differed significantly from the MMLU benchmark. The Eleuther AI LM Evaluation Harness, a wrapper running the open-source benchmarking library called the Open LLM Leaderboard, was used to evaluate the LLaMA models. The leaderboard displays the results obtained using this harness. The Falcon team and other collaborators were involved in evaluating the LLaMA models. The MMLU evaluation numbers of the current top model on the leaderboard were lower than those reported in the published LLaMa paper. The discussion centered around one of the four evaluations on the leaderboard, which measures Massive Multitask Language Understanding.

Raw indexed text (16,503 chars / 2,666 words / 275 lines)

What's going on with the Open LLM Leaderboard?

Hugging Face

Models

Datasets

Spaces

Docs

Solutions

Pricing

Back to blog

What's going on with the Open LLM Leaderboard?

Published

June 23, 2023

Update on GitHub

clefourrier

Clmentine Fourrier

SaylorTwift

Nathan Habib

slippylolo

Julien Launay

thomwolf

Thomas Wolf

Recently an interesting discussion arose on Twitter following the release of

Falcon

and its addition to the

Open LLM Leaderboard

, a public leaderboard comparing open access large language models.

The discussion centered around one of the four evaluations displayed on the leaderboard: a benchmark for measuring

Massive Multitask Language Understanding

(shortname: MMLU).

The community was surprised that MMLU evaluation numbers of the current top model on the leaderboard, the

LLaMA model

, were significantly lower than the numbers in the

published LLaMa paper

So we decided to dive in a rabbit hole to understand what was going on and how to fix it

In our quest, we discussed with the great

@javier-m

who collaborated on the evaluations of LLaMA as well as the no-less amazing

@slippylolo

from the Falcon team. This being said, all the errors in the below should be attributed to us rather than them of course!

Along this journey with us youll learn a lot about the ways you can evaluate a model on a single evaluation and whether or not to believe the numbers you see online and in papers.

Ready? Then buckle up, were taking off .

What's the Open LLM Leaderboard?

First, note that the

Open LLM Leaderboard

is actually just a wrapper running the open-source benchmarking library

Eleuther AI LM Evaluation Harness

created by the

EleutherAI non-profit AI research lab

famous for creating

The Pile

and training

GPT-J

GPT-Neo-X 20B

, and

Pythia

. A team with serious credentials in the AI space!

This wrapper runs evaluations using the Eleuther AI harness on the spare cycles of Hugging Faces compute cluster, and stores the results in a dataset on the hub that are then displayed on the

leaderboard online space

For the LLaMA models, the MMLU numbers obtained with the

Eleuther AI LM Evaluation Harness

significantly differ from the MMLU numbers reported in the LLaMa paper.

Why is that the case?

1001 flavors of MMLU

Well it turns out that the LLaMA team adapted another code implementation available online: the evaluation code proposed by the original UC Berkeley team which developed the MMLU benchmark available at

https://github.com/hendrycks/test

and that we will call here the

"Original implementation"

When diving further, we found yet another interesting implementation for evaluating on the very same MMLU dataset: the evalution code provided in Stanfords

CRFM

very comprehensive evaluation benchmark

Holistic Evaluation of Language Models

that we will call here the

HELM implementation

Both the EleutherAI Harness and Stanford HELM benchmarks are interesting because they gather many evaluations in a single codebase (including MMLU), and thus give a wide view of a models performances. This is the reason the Open LLM Leaderboard is wrapping such holistic benchmarks instead of using individual code bases for each evaluation.

To settle the case, we decided to run these three possible implementations of the same MMLU evaluation on a set of models to rank them according to these results:

the Harness implementation (

commit e47e01b

the HELM implementation (

commit cab5d89

the Original implementation (with Hugging Face integration by the amazing

@olmer

https://github.com/hendrycks/test/pull/13

(Note that the Harness implementation has been recently updated - more in this at the end of our post)

The results are surprising:

You can find the full evaluation numbers at the end of the post.

These different implementations of the same benchmark give widely different numbers and even change the ranking order of the models on the leaderboard!

Lets try to understand where this discrepancy comes from But first, lets briefly understand how we can automatically evaluate behaviors in modern LLMs.

How we automatically evaluate a model in todays LLM world

MMLU is a multiple choice question test, so a rather simple benchmark (versus open-ended questions) but as well see, this still leaves a lot of room for implementation details and differences. The benchmark consists of questions with four possible answers covering 57 general knowledge domains grouped in coarse grained categories: Humanities, Social Sciences, STEM, etc

For each question, only one of the provided answers is the correct one. Here is an example:

Question: Glucose is transported into the muscle cell:

Choices:

A. via protein transporters called GLUT4.

B. only in the presence of insulin.

C. via hexokinase.

D. via monocarbylic acid transporters.

Correct answer: A

Note: you can very easily explore more of this dataset

in the dataset viewer

on the hub.

Large language models are simple models in the AI model zoo. They take a

string of text

as input (called a prompt), which is cut into tokens (words, sub-words or characters, depending on how the model is built) and fed in the model. From this input, they generate a distribution of probability for the next token, over all the tokens they know (so called the vocabulary of the model): you can therefore get how `probable any token is as a continuation of the input prompt.

We can use these probabilities to choose a token, for instance the most probable (or we can introduce some slight noise with a sampling to avoid having too mechanical answers). Adding our selected token to the prompt and feeding it back to the model allows to generate another token and so on until whole sentences are created as continuations of the input prompt:

This is how ChatGPT or Hugging Chat generate answers.

In summary, we have two main ways to get information out of a model to evaluate it:

get the

probabilities

that some specific tokens groups are continuations of the prompt and

compare these probabilities together

for our predefined possible choices;

get a

text generation

from the model (by repeatedly selecting tokens as weve seen) and

compare these text generations

to the texts of various predefined possible choices.

Armed with this knowledge, let's dive into our three implementations of MMLU, to find out what input is sent to models, what is expected as outputs, and how these outputs are compared.

MMLU comes in all shapes and sizes: Looking at the prompts

Lets compare an example of prompt each benchmark sends to the models by each implmentation for the same MMLU dataset example:

Original implementation

Ollmer PR

HELM

commit cab5d89

AI Harness

commit e47e01b

The following are multiple choice questions (with answers) about us foreign policy.

How did the 2008 financial crisis affect America's international reputation?

A. It damaged support for the US model of political economy and capitalism

B. It created anger at the United States for exaggerating the crisis

C. It increased support for American global leadership under President Obama

D. It reduced global use of the US dollar

Answer:

The following are multiple choice questions (with answers) about us foreign policy.

Question: How did the 2008 financial crisis affect America's international reputation?

A. It damaged support for the US model of political economy and capitalism

B. It created anger at the United States for exaggerating the crisis

C. It increased support for American global leadership under President Obama

D. It reduced global use of the US dollar

Answer:

Question: How did the 2008 financial crisis affect America's international reputation?

Choices:

A. It damaged support for the US model of political economy and capitalism

B. It created anger at the United States for exaggerating the crisis

C. It increased support for American global leadership under President Obama

D. It reduced global use of the US dollar

Answer:

The differences between them can seem small, did you spot them all? Here they are:

First sentence, instruction, and topic: Few differences. HELM adds an extra space, and the Eleuther LM Harness does not include the topic line

Question: HELM and the LM Harness add a Question: prefix

Choices: Eleuther LM Harness prepends them with the keyword Choice

Now how do we evaluate the model from these prompts?

Lets start with how the

original MMLU implementation

extracts the predictions of the model. In the original implementation we compare the probabilities predicted by the model, on the four answers only:

This can be beneficial for the model in some case, for instance, as you can see here:

In this case, the model got a +1 score for ranking the correct answer highest among the 4 options. But if we take a look at the full vocabulary it would have rather generated a word outside of our four options: the word Zygote (this is more of an example than a real use case )

How can we make sure that the model does as few as possible of these types of errors?

We can use a

few shots

approach in which we provide the model with one or several examples in the prompt, with their expected answers as well. Here is how it looks:

Here, the model has one example of the expected behavior and is thus less likely to predict answers outside of the expected range of answers.

Since this improves performance, MMLU is typically evaluated in 5 shots (prepending 5 examples to each prompt) in all our evaluations: the original implementation, EleutherAI LM Harness and HELM. (Note: Across benchmarks, though the same 5 examples are used, their order of introduction to the model can vary, which is also a possible source of difference, that we will not investigate here. You also obviously have to pay attention to avoid leaking some answers in the few shot examples you use)

HELM:

Lets now turn to the

HELM implementation

. While the few-shot prompt is generally similar, the way the model is evaluated is quite different from the original implementation weve just seen: we use the next token output probabilities from the model to select a text generation and we compare it to the text of the expected answer as displayed here:

In this case, if our "Zygote" token was instead the highest probability one (as weve seen above), the model answer ("Zygote") would be wrong and the model would not score any point for this question:

Harness:

Now we finally turn to the -

EleutherAI Harness implementation as of January 2023

which was used to compute the first numbers for the leaderboard. As we will see, weve got here yet another way to compute a score for the model on the very same evaluation dataset (note that this implementation has been recently updated - more in this at the end).

In this case, we are using the probabilities again but this time the probabilities of the full answer sequence, with the letter followed by the text of the answer, for instance C. The second pharyngeal arch. To compute the probability for a full answer we get the probability for each token (like we saw above) and gather them. For numerical stability we gather them by summing the logarithm of the probabilities and we can decide (or not) to compute a normalization in which we divide the sum by the number of tokens to avoid advantaging too much longer answers (more on this later). Here is how it looks like:

Here is a table summary of the answers provided and generated by the model to summarize what weve seen up to now:

Original implementation

HELM

AI Harness (as of Jan 2023)

We compare the probabilities of the following letter answers:

The model is expected to generate as text the following letter answer:

We compare the probabilities of the following full answers:

A. It damaged support for the US model of political economy and capitalism

B. It created anger at the United States for exaggerating the crisis

C. It increased support for American global leadership under President Obama

D. It reduced global use of the US dollar

Weve covered them all!

Now lets compare the model scores on these three possible ways to evaluate the models:

MMLU (HELM)

MMLU (Harness)

MMLU (Original)

huggingface/llama-65b

0.637

0.488

0.636

tiiuae/falcon-40b

0.571

0.527

0.558

huggingface/llama-30b

0.583

0.457

0.584

EleutherAI/gpt-neox-20b

0.256

0.333

0.262

huggingface/llama-13b

0.471

0.377

0.47

huggingface/llama-7b

0.339

0.342

0.351

tiiuae/falcon-7b

0.278

0.35

0.254

togethercomputer/RedPajama-INCITE-7B-Base

0.275

0.34

0.269

We can see that for the same dataset, both absolute scores and model rankings (see the first figure) are very sensitive to the evaluation method we decide to use.

Let's say you've trained yourself a perfect reproduction of the LLaMA 65B model and evaluated it with the harness (score 0.488, see above). You're now comparing it to the published number (evaluated on the original MMLU implementation so with a score 0.637). With such a 30% difference in score you're probably thinking: "Oh gosh, I have completly messed up my training ". But nothing could be further from the truth, these are just numbers which are not at all comparable even if they're both labelled as "MMLU score" (and evaluated on the very same MMLU dataset).

Now, is there a "best way" to evaluate a model among all the ones we've seen? It's a tricky question. Different models may fare differently when evaluated one way or another as we see above when the rankings change. To keep this as fair as possible, one may be tempted to select an implementation where the average score for all tested models is the highest so that we "unlock" as much capabilities as possible from the models. In our case, that would mean using the loglikelihood option of the original implementation. But as we saw above, using the loglikelihood is also giving some indications to the model in some way by restricting the scope of possible answers, and thus is helping the less powerful models maybe too much. Also Loglikelihood is easy to access for open-source models but is not always exposed for closed source API models.

And you, reader, what do you think? This blog post is already long so it's time to open the discussion and invite your comments. Please come discuss this topic in the following discussion thread of the Open LLM Leaderboard:

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/82

Conclusion

A key takeaway lesson from our journey is that evaluations are strongly tied to their implementationsdown to minute details such as prompts and tokenization. The mere indication of "MMLU results" give you little to no information about how you can compare these numbers to other you evaluated on another library.

This is why open, standardized, and reproducible benchmarks such as the

EleutherAI Eval Harness

Stanford HELM

are invaluable to the community. Without them, comparing results across models and papers would be impossible, stifling research on improving LLMs.

Post scriptum

: In the case of the Open LLM Leaderboard weve decided to stick to using community maintained evaluation libraries. Thankfully during the writing of this blog post, the amazing community around the EleutherAI Harness, and in particular

ollmer

have done an amazing work updating the evaluation of MMLU in the harness to make it similar to the original implementation and match these numbers.

We are currently updating the full leaderboard with the updated version of the

EleutherAI Eval Harness

, so expect to see scores coming from the Eleuther Harness v2 coming up in the next few weeks! (Running all the models again will take some time, stay tuned :hugs:)

Acknowledgements:

We are very grateful to Xavier Martinet, Aurlien Rodriguez and Sharan Narang from the LLaMA team for helpful suggestion in this blog post as well as having answered all our questions.

Reproducibility hashes:

Here are the commit hashes of the various code implementation used in this blog post.

EleutherAI LM harness implementation commit e47e01b:

https://github.com/EleutherAI/lm-evaluation-harness/tree/e47e01beea79cfe87421e2dac49e64d499c240b4

HELM implementation commit cab5d89:

https://github.com/stanford-crfm/helm/tree/cab5d89fadbff86190f29ddfa497301958eaf2ec

Original MMLU implementation (with Hugging Face integration by the amazing

@olmer

https://github.com/hendrycks/test/pull/13