WEBVOYAGER · 610 TASKS · 98.9% · 603/610 PASSED · BASELINE 98.5% · RANK 01 · APR 21 2026WEBVOYAGER · 610 TASKS · 98.9% · 603/610 PASSED · BASELINE 98.5% · RANK 01 · APR 21 2026
Om Labs

Jina MCP + Claude Code + Opus 4.7 + GPT 5.4 Nano

Rank 01 · WebVoyager · 610 real-website tasks · Apr 21 2026

RESULT
98.9%
SOTA ▲NEW · APR 21 2026
603/610 passed
Baseline 98.5%Judge gpt-5Evaluated with 15 screenshots/task
RUN STATS
Duration48h 24mavg 4m 45s
Tokens46.7Mavg 97K
Steps9503avg 15.6
ABOUT

WebVoyager benchmarks agents that browse real, live websites and return natural-language answers. Example: “Search for women’s hiking boots on Amazon filtered to waterproof, 4★+, size 6.” Other notable entrants include OpenAI’s Computer-Using Agent, Google DeepMind’s Project Mariner, and H Company’s Surfer 2, all of which have reported results on this benchmark.

Introduced by He et al. 2024. This run follows the evaluation scaffolding from alumnium-hq/WebVoyager · run_claude_code.py.

⚠ METHODOLOGY NOTE — scores are judged by gpt-5 against up to 15 screenshots per task. A handful of task questions with expired date anchors (March 2026) were minimally adjusted to 2026/2027 so the agent could complete them in a post-March 2026 run.

STAGE SELECT
GROUP BY
STAGES · 610
610 results