ClawBench: Can AI Agents Complete Everyday Online Tasks? 153 tasks, 144 live websites, best model at 33.3% [R]

digitado ⋅ 14 de April de 2026

We introduce ClawBench, a benchmark that evaluates AI browser agents on 153 real-world everyday tasks across 144 live websites. Unlike synthetic benchmarks, ClawBench tests agents on actual production platforms.

Key findings:

The best model (Claude Sonnet 4.6) achieves only 33.3% success rate
GLM-5 (Zhipu AI) comes second at 24.2% — surprisingly strong for a text-only model
Finance and Academic tasks are easier (50% for the best model); Travel and Dev tasks are much harder
No model exceeds 50% in any category — there’s a long way to go

What makes ClawBench different:

Tasks on real live websites, not sandboxed environments
5 layers of behavioral data: session replay, screenshots, HTTP traffic, agent reasoning traces, browser actions
Request interceptor blocks the final HTTP request before irreversible actions (payments, bookings), enabling safe evaluation
Human ground-truth for every task
Agentic evaluator with step-level traceable diagnostics

Resources:

Paper: https://arxiv.org/abs/2604.08523
Website (interactive leaderboard + trace viewer): https://claw-bench.com
Dataset: https://huggingface.co/datasets/NAIL-Group/ClawBench
GitHub: https://github.com/reacher-z/ClawBench
PyPI: pip install clawbench-eval

Happy to answer any questions! We’re actively looking for feedback on task selection and evaluation methodology.

[R] Research

submitted by /u/Extreme_Play_8554
[link] [comments]

Like 0

Liked Liked