Technology
MT-Bench
A high-quality multi-turn benchmark that uses GPT-4 as a judge to evaluate how large language models handle complex, conversational instructions.
MT-Bench consists of 80 high-quality multi-turn questions across eight categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities). Developed by the LMSYS Org team (the creators of Chatbot Arena), it measures a model's ability to maintain coherence and follow instructions over two-step interactions. The framework utilizes a curated set of prompts and an automated LLM-as-a-judge system (typically GPT-4) to provide scalable, human-aligned scoring. This approach offers a 0.8 correlation with human preferences while significantly reducing the time and cost required for traditional manual evaluation.
Recent Talks & Demos
Showing 1-0 of 0