|
| 1 | +# Agent Eval Harness |
| 2 | + |
| 3 | +Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting. |
| 4 | + |
| 5 | +## Project Status |
| 6 | + |
| 7 | +Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +``` |
| 12 | +agent_eval/ # Python package (config, runner, state) |
| 13 | + config.py # EvalConfig from eval.yaml |
| 14 | + state.py # Shared state persistence (key-value store) |
| 15 | + agent/ |
| 16 | + base.py # EvalRunner ABC + RunResult |
| 17 | + claude_code.py # Claude Code CLI runner (claude --print) |
| 18 | + mlflow/ |
| 19 | + experiment.py # MLflow experiment setup, server check, feedback logging |
| 20 | + datasets.py # Dataset create/sync utilities |
| 21 | + traces.py # Trace search and input extraction |
| 22 | +
|
| 23 | +skills/eval-setup/ # Skill: environment setup |
| 24 | + SKILL.md # Dependencies, MLflow, API keys, directories |
| 25 | + scripts/ |
| 26 | + check_env.py # Preflight environment checks |
| 27 | +
|
| 28 | +skills/eval-analyze/ # Skill: bootstrap eval config |
| 29 | + SKILL.md # Analyze skill, generate eval.yaml + eval.md |
| 30 | + scripts/ |
| 31 | + find_skills.py # Skill discovery (reads plugin.json for paths) |
| 32 | + validate_eval.py # Config and memory validation |
| 33 | + prompts/ |
| 34 | + analyze-skill.md # Skill analysis prompt |
| 35 | + generate-eval-md.md # eval.md generation prompt |
| 36 | + references/ |
| 37 | + eval-yaml-template.md # Full eval.yaml template for generation |
| 38 | +
|
| 39 | +skills/eval-dataset/ # Skill: generate test cases |
| 40 | + SKILL.md # Bootstrap, expand, or extract cases from traces |
| 41 | +
|
| 42 | +skills/eval-run/ # Skill: execute eval suite |
| 43 | + SKILL.md # Prepare, execute, collect, score, report |
| 44 | + scripts/ |
| 45 | + workspace.py # Workspace creation, batch.yaml, symlinks |
| 46 | + execute.py # Skill execution via agent runner |
| 47 | + collect.py # Artifact collection + case mapping |
| 48 | + score.py # Scoring: inline checks, LLM judges, pairwise, regression |
| 49 | + report.py # HTML report generation (scoring summary, per-case details, diffs) |
| 50 | + tools.py # PreToolUse hook for tool interception |
| 51 | + prompts/ |
| 52 | + analyze-results.md # Results interpretation prompt |
| 53 | + comparison-judge.md # Pairwise comparison judge prompt |
| 54 | + references/ |
| 55 | + data-pipeline.md # Dataset → workspace → execution → scoring flow |
| 56 | + tool-interception.md # Tool interception format and field reference |
| 57 | +
|
| 58 | +skills/eval-review/ # Skill: interactive human review |
| 59 | + SKILL.md # Present results, collect feedback, propose changes |
| 60 | + prompts/ |
| 61 | + review-results.md # Analysis framework for feedback patterns |
| 62 | +
|
| 63 | +skills/eval-mlflow/ # Skill: MLflow integration |
| 64 | + SKILL.md # Dataset sync, result logging, trace feedback |
| 65 | + scripts/ |
| 66 | + sync_dataset.py # Push cases to MLflow dataset registry |
| 67 | + log_results.py # Log run params, metrics, artifacts to MLflow |
| 68 | + attach_feedback.py # Push/pull feedback between harness and traces |
| 69 | + from_traces.py # Extract inputs from production traces |
| 70 | +
|
| 71 | +skills/eval-optimize/ # Skill: automated refinement loop |
| 72 | + SKILL.md # Composes with /eval-run via Skill tool |
| 73 | +``` |
| 74 | + |
| 75 | +## How It Works |
| 76 | + |
| 77 | +Skills projects create an `eval.yaml` config file with: |
| 78 | +- `skill` — skill to evaluate |
| 79 | +- `arguments` — arguments string passed to the skill invocation |
| 80 | +- `runner` — agent runner (`claude-code`, etc.), `runner_options` for runner-specific settings |
| 81 | +- `permissions` — `allow`/`deny` tool patterns for headless execution |
| 82 | +- `dataset` — `path` to test cases directory, `schema` describing case structure in natural language |
| 83 | +- `inputs.tools` — tool interception for headless eval: `match` describes what to intercept, `prompt` how to handle it |
| 84 | +- `outputs` — list of artifact dirs (`path`) and/or tool calls (`tool`) with natural language schemas |
| 85 | +- `traces` — execution data to capture: stdout/stderr, events, metrics (exit code, tokens, cost) |
| 86 | +- `judges` — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function` |
| 87 | +- `thresholds` — regression detection per judge |
| 88 | + |
| 89 | +Runs are stored in `$AGENT_EVAL_RUNS_DIR` (default `eval/runs`), configured during `/eval-setup`. |
| 90 | + |
| 91 | +The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names. |
| 92 | + |
| 93 | +## Usage |
| 94 | + |
| 95 | +``` |
| 96 | +/eval-setup # Setup: dependencies, MLflow, API keys |
| 97 | +/eval-analyze --skill my-skill # Analyze: understand skill, generate eval.yaml |
| 98 | +/eval-dataset # Dataset: generate test cases |
| 99 | +/eval-run --model opus # Run: execute eval suite |
| 100 | +/eval-review --run-id <id> # Review: interactive human feedback + changes |
| 101 | +/eval-mlflow --run-id <id> # MLflow: sync dataset, log results |
| 102 | +/eval-optimize --model opus # Optimize: automated refinement loop |
| 103 | +``` |
| 104 | + |
| 105 | +## Key Design Decisions |
| 106 | + |
| 107 | +1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files |
| 108 | +2. **Agent-agnostic runner** — `EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK |
| 109 | +3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function` |
| 110 | +4. **MLflow as separate skill** — `/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it |
| 111 | + |
| 112 | +## Remaining Work |
| 113 | + |
| 114 | +- Skills and refinement loop (`/eval-optimize` implementation) |
| 115 | +- MLflow tracing integration (extended transcript parser with subagent hierarchy) |
| 116 | +- CI integration patterns |
| 117 | +- Testing and documentation |
| 118 | +- Publish to PyPI or marketplace |
0 commit comments