Technology

MT-Bench

A high-quality multi-turn benchmark that uses GPT-4 as a judge to evaluate how large language models handle complex, conversational instructions.

MT-Bench consists of 80 high-quality multi-turn questions across eight categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities). Developed by the LMSYS Org team (the creators of Chatbot Arena), it measures a model's ability to maintain coherence and follow instructions over two-step interactions. The framework utilizes a curated set of prompts and an automated LLM-as-a-judge system (typically GPT-4) to provide scalable, human-aligned scoring. This approach offers a 0.8 correlation with human preferences while significantly reducing the time and cost required for traditional manual evaluation.

https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge

0 projects · 0 cities

Recent Talks & Demos

Showing 1-0 of 0

Members-Only

No public projects found for this technology yet.