Technology
WebArena
WebArena is a realistic, self-hostable web environment and benchmark for rigorously evaluating autonomous agents on complex, long-horizon tasks.
WebArena provides a high-fidelity, reproducible environment for agent development: it features fully functional websites across four common domains (e.g., e-commerce, social forums, collaborative development). The benchmark focuses on evaluating functional correctness for tasks initiated by natural language commands. It includes utility tools like maps and external knowledge bases (e.g., Wikipedia, user manuals) to encourage human-like problem-solving. Initial results confirm the challenge: the best GPT-4-based agent only achieved a 14.41% success rate on the benchmark, highlighting the significant gap to human performance (78.24%).
Related technologies
Recent Talks & Demos
Showing 1-1 of 1