Releases · reacher-z/ClawBench

ClawBench v1.0.0

Initial public release of ClawBench -- a benchmark for evaluating AI agents on 153 everyday online tasks across 144 live production websites.

153 tasks spanning 15 life categories (daily life, travel, education, job search, etc.)
144 live websites -- real production sites, not sandboxed clones
Isolated Docker containers with Chromium for each run
Request interceptor that blocks the final irreversible action to prevent real-world side effects
Five-layer recording: MP4 replay, screenshots, HTTP traffic, DOM actions, agent messages
Interactive TUI for model selection, test case picking, and run management
6 frontier models evaluated: Claude Sonnet 4.6 (33.3%), GLM-5 (24.2%), Gemini 3 Flash (19.0%), Claude Haiku 4.5 (18.3%), GPT-5.4 (6.5%), Gemini 3.1 Flash Lite (3.3%)