Jina MCP + Claude Code + Opus 4.7 + GPT 5.4 Nano
Rank 01 · WebVoyager · 610 real-website tasks · Apr 21 2026
WebVoyager benchmarks agents that browse real, live websites and return natural-language answers. Example: “Search for women’s hiking boots on Amazon filtered to waterproof, 4★+, size 6.” Other notable entrants include OpenAI’s Computer-Using Agent, Google DeepMind’s Project Mariner, and H Company’s Surfer 2, all of which have reported results on this benchmark.
Introduced by He et al. 2024. This run follows the evaluation scaffolding from alumnium-hq/WebVoyager · run_claude_code.py.
⚠ METHODOLOGY NOTE — scores are judged by gpt-5 against up to 15 screenshots per task. A handful of task questions with expired date anchors (March 2026) were minimally adjusted to 2026/2027 so the agent could complete them in a post-March 2026 run.
