Technology

JudgeBench

JudgeBench is the rigorous benchmark for objectively stress-testing LLM-based judges on factual and logical correctness, not just human preference alignment.

This is JudgeBench: a novel evaluation framework designed to rigorously test LLM-based judges, moving beyond subjective human preferences. We utilize a specialized pipeline to convert existing challenging datasets into 350 curated response pairs, forcing the judge model to detect subtle factual or logical errors. The results are clear: JudgeBench poses a significantly greater challenge than previous benchmarks, with top-tier models like GPT-4o often performing only slightly better than random guessing and the strongest model (Claude-3.5-Sonnet) achieving just 64% accuracy. This benchmark provides the reliable platform required to assess and advance the next generation of LLM evaluators.

https://github.com/ScalerLab/JudgeBench

1 project · 1 city

Related technologies

GPT-3 191 GPT-4o 56

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Auto-create Reliable LLM Evals

New York City Jul 24

GPT-3 GPT-4o