Jina
Claude Code · Opus 4.7 · GPT 5.4 Nano
Om Labs’ submission to the WebVoyager benchmark — 610 live-website tasks across 15 real sites. Judged by gpt-5 against up to 15 screenshots per task.
Run stats
About
MethodologyWebVoyager benchmarks agents that browse real, live websites and return natural-language answers. Example: “Search for women’s hiking boots on Amazon filtered to waterproof, 4★+, size 6.” Other notable entrants include OpenAI’s Computer-Using Agent, Google DeepMind’s Project Mariner, and H Company’s Surfer 2, all of which have reported results on this benchmark.
Introduced by He et al. 2024. This run follows the evaluation scaffolding from alumnium-hq/WebVoyager · run_claude_code.py.
Note — scores are judged by gpt-5 against up to 15 screenshots per task. A handful of task questions with expired date anchors (March 2026) were minimally adjusted to 2026/2027 so the agent could complete them in a post-March 2026 run.
