Technology
lm-evaluation-harness
EleutherAI’s standardized framework for benchmarking generative language models across 200+ evaluation tasks.
EleutherAI built this library to standardize LLM performance measurement. It supports over 200 tasks (including MMLU, GSM8K, and Hellaswag) while integrating with backends like Hugging Face Transformers, vLLM, and OpenAI APIs. The harness automates complex evaluations: few-shot prompting, task-specific formatting, and metric calculation (accuracy or perplexity). It remains the primary engine powering the Hugging Face Open LLM Leaderboard.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1