Technology

OmniParser v2

Microsoft Research's AI screen parser: it converts any graphical user interface (GUI) screenshot into structured, LLM-interpretable data.

OmniParser v2 is a sophisticated AI parser designed to power autonomous GUI agents: it tokenizes UI screenshots from pixel space into structured elements. The system operates via a two-step process: a finely tuned YOLOv8 model detects interactive elements, and the Florence-2 foundation model generates descriptive labels, clarifying element function. This architecture delivers serious performance gains, including a 60% reduction in latency compared to V1.0. This speed and accuracy enables Large Language Models (LLMs) like GPT-4o and DeepSeek R1 to comprehend and interact with complex interfaces, achieving state-of-the-art results on benchmarks like ScreenSpot Pro.

https://github.com/microsoft/OmniParser

1 project · 1 city

Related technologies

GPT-4o 56 LangChain 438 Node 84 TypeScript 177

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Mobile Agent: Vision Operator

Delhi Mar 22

GPT-4o OmniParser v2