Technology

lm-evaluation-harness

EleutherAI’s standardized framework for benchmarking generative language models across 200+ evaluation tasks.

EleutherAI built this library to standardize LLM performance measurement. It supports over 200 tasks (including MMLU, GSM8K, and Hellaswag) while integrating with backends like Hugging Face Transformers, vLLM, and OpenAI APIs. The harness automates complex evaluations: few-shot prompting, task-specific formatting, and metric calculation (accuracy or perplexity). It remains the primary engine powering the Hugging Face Open LLM Leaderboard.

https://github.com/EleutherAI/lm-evaluation-harness

1 project · 1 city

Related technologies

Kubernetes 29 TrustyAI 1

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Guardrailed AI for Constrained Environments

Milan Feb 24

Kubernetes TrustyAI