lm-evaluation-harness Projects .

Technology

lm-evaluation-harness

EleutherAI’s standardized framework for benchmarking generative language models across 200+ evaluation tasks.

EleutherAI built this library to standardize LLM performance measurement. It supports over 200 tasks (including MMLU, GSM8K, and Hellaswag) while integrating with backends like Hugging Face Transformers, vLLM, and OpenAI APIs. The harness automates complex evaluations: few-shot prompting, task-specific formatting, and metric calculation (accuracy or perplexity). It remains the primary engine powering the Hugging Face Open LLM Leaderboard.

https://github.com/EleutherAI/lm-evaluation-harness
1 project · 1 city

Related technologies

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Sign in to see who built these projects