Summary What's going on with the Open LLM Leaderboard? huggingface.co
2,666 words - html page - View html page
One Line
The Open LLM Leaderboard, a public leaderboard for large language models, has sparked discussion due to discrepancies in benchmark numbers and the use of different evaluation methods and implementations.
Slides
Slide Presentation (5 slides)
Key Points
- The Open LLM Leaderboard is a public leaderboard comparing open access large language models.
- There was a discussion on Twitter about the MMLU evaluation numbers of the top model on the leaderboard being significantly lower than the numbers in the published LLaMa paper.
- The Open LLM Leaderboard uses the Eleuther AI LM Evaluation Harness to run evaluations and store results.
- Different implementations of the MMLU evaluation give different numbers and change the ranking order of the models on the leaderboard.
- Evaluations are strongly tied to their implementations, including prompts and tokenization, making comparisons across models and papers difficult.
- Open, standardized, and reproducible benchmarks like the EleutherAI Eval Harness and Stanford HELM are important for comparing results and improving language models.
- The EleutherAI Harness has been updated to match the original MMLU implementation, and the leaderboard will be updated with the new scores.
Summary
266 word summary
Hugging Face has announced new Content Guidelines and Policy. They are updating the Open LLM Leaderboard with an updated version of the EleutherAI Eval Harness. The leaderboard will include scores from the Eleuther Harness v2 in the coming weeks. The evaluation of MMLU in the harness has been updated to match the original implementation. The community around EleutherAI Harness, Stanford HELM, and other evaluation libraries are invaluable for comparing results across models and papers. Evaluations are strongly tied to their implementations, including prompts and tokenization. Different evaluation methods can yield different scores and rankings for models. The blog post invites readers to join the discussion on the Open LLM Leaderboard. The post also discusses the challenges and differences in evaluating models using different implementations, such as the original MMLU implementation, EleutherAI Harness, and Stanford HELM. The Open LLM Leaderboard, a public leaderboard for large language models, has generated discussion due to discrepancies in the reported benchmark numbers. The LLaMA team used the evaluation code from the original UC Berkeley team, but their numbers differed significantly from the MMLU benchmark. The Eleuther AI LM Evaluation Harness, a wrapper running the open-source benchmarking library called the Open LLM Leaderboard, was used to evaluate the LLaMA models. The leaderboard displays the results obtained using this harness. The Falcon team and other collaborators were involved in evaluating the LLaMA models. The MMLU evaluation numbers of the current top model on the leaderboard were lower than those reported in the published LLaMa paper. The discussion centered around one of the four evaluations on the leaderboard, which measures Massive Multitask Language Understanding.