Technology
OmniParser v2
Microsoft Research's AI screen parser: it converts any graphical user interface (GUI) screenshot into structured, LLM-interpretable data.
OmniParser v2 is a sophisticated AI parser designed to power autonomous GUI agents: it tokenizes UI screenshots from pixel space into structured elements. The system operates via a two-step process: a finely tuned YOLOv8 model detects interactive elements, and the Florence-2 foundation model generates descriptive labels, clarifying element function. This architecture delivers serious performance gains, including a 60% reduction in latency compared to V1.0. This speed and accuracy enables Large Language Models (LLMs) like GPT-4o and DeepSeek R1 to comprehend and interact with complex interfaces, achieving state-of-the-art results on benchmarks like ScreenSpot Pro.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1