Releases: reacher-z/ClawBench
Releases · reacher-z/ClawBench
ClawBench v1.0.0
ClawBench v1.0.0
Initial public release of ClawBench -- a benchmark for evaluating AI agents on 153 everyday online tasks across 144 live production websites.
Highlights
- 153 tasks spanning 15 life categories (daily life, travel, education, job search, etc.)
- 144 live websites -- real production sites, not sandboxed clones
- Isolated Docker containers with Chromium for each run
- Request interceptor that blocks the final irreversible action to prevent real-world side effects
- Five-layer recording: MP4 replay, screenshots, HTTP traffic, DOM actions, agent messages
- Interactive TUI for model selection, test case picking, and run management
- 6 frontier models evaluated: Claude Sonnet 4.6 (33.3%), GLM-5 (24.2%), Gemini 3 Flash (19.0%), Claude Haiku 4.5 (18.3%), GPT-5.4 (6.5%), Gemini 3.1 Flash Lite (3.3%)
Links
- Paper: https://arxiv.org/abs/2604.08523
- Project Page: https://claw-bench.com