SOTA
WEBVOYAGER · 610 TASKS · 98.9% · 603/610 PASSED · BASELINE 98.5% · RANK 01 · APR 21 2026WEBVOYAGER · 610 TASKS · 98.9% · 603/610 PASSED · BASELINE 98.5% · RANK 01 · APR 21 2026
WebVoyager Leaderboard·Rank 01·Apr 21 2026

Jina

Claude Code · Opus 4.7 · GPT 5.4 Nano

Om Labs’ submission to the WebVoyager benchmark — 610 live-website tasks across 15 real sites. Judged by gpt-5 against up to 15 screenshots per task.

98.9%
SOTANew · Apr 21 2026
603/610 passed
Baseline 98.5%Judge gpt-5Per task 15 screenshots

Run stats

Duration48h 24mavg 4m 45s
Tokens46.7Mavg 97K
Steps9503avg 15.6

About

Methodology

WebVoyager benchmarks agents that browse real, live websites and return natural-language answers. Example: “Search for women’s hiking boots on Amazon filtered to waterproof, 4★+, size 6.” Other notable entrants include OpenAI’s Computer-Using Agent, Google DeepMind’s Project Mariner, and H Company’s Surfer 2, all of which have reported results on this benchmark.

Introduced by He et al. 2024. This run follows the evaluation scaffolding from alumnium-hq/WebVoyager · run_claude_code.py.

Note — scores are judged by gpt-5 against up to 15 screenshots per task. A handful of task questions with expired date anchors (March 2026) were minimally adjusted to 2026/2027 so the agent could complete them in a post-March 2026 run.

Per-site results

Group by

Tasks · 610

610 results