Skip to content

Releases: reacher-z/ClawBench

ClawBench v1.0.0

11 Apr 22:31
e30d4e3

Choose a tag to compare

ClawBench v1.0.0

Initial public release of ClawBench -- a benchmark for evaluating AI agents on 153 everyday online tasks across 144 live production websites.

Highlights

  • 153 tasks spanning 15 life categories (daily life, travel, education, job search, etc.)
  • 144 live websites -- real production sites, not sandboxed clones
  • Isolated Docker containers with Chromium for each run
  • Request interceptor that blocks the final irreversible action to prevent real-world side effects
  • Five-layer recording: MP4 replay, screenshots, HTTP traffic, DOM actions, agent messages
  • Interactive TUI for model selection, test case picking, and run management
  • 6 frontier models evaluated: Claude Sonnet 4.6 (33.3%), GLM-5 (24.2%), Gemini 3 Flash (19.0%), Claude Haiku 4.5 (18.3%), GPT-5.4 (6.5%), Gemini 3.1 Flash Lite (3.3%)

Links