|
| 1 | +# Agent Eval Harness |
| 2 | + |
| 3 | +Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting. |
| 4 | + |
| 5 | +## Project Status |
| 6 | + |
| 7 | +Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc. |
| 8 | + |
| 9 | +## Architecture |
| 10 | + |
| 11 | +``` |
| 12 | +agent_eval/ # Python package (config, runner, state) |
| 13 | + config.py # EvalConfig from eval.yaml |
| 14 | + state.py # Shared state persistence (key-value store) |
| 15 | + agent/ |
| 16 | + base.py # EvalRunner ABC + RunResult |
| 17 | + claude_code.py # Claude Code CLI runner (claude --print) |
| 18 | + mlflow/ |
| 19 | + experiment.py # MLflow experiment setup (used by eval-setup) |
| 20 | +
|
| 21 | +skills/eval-setup/ # Skill: environment setup |
| 22 | + SKILL.md # Dependencies, MLflow, API keys, directories |
| 23 | + scripts/ |
| 24 | + check_env.py # Preflight environment checks |
| 25 | +
|
| 26 | +skills/eval-analyze/ # Skill: bootstrap eval config |
| 27 | + SKILL.md # Analyze skill, generate eval.yaml + eval.md |
| 28 | + scripts/ |
| 29 | + discover.py # Skills and config discovery |
| 30 | + prompts/ |
| 31 | + analyze-skill.md # Skill analysis prompt |
| 32 | + generate-eval-md.md # eval.md generation prompt |
| 33 | +
|
| 34 | +skills/eval-run/ # Skill: execute eval suite |
| 35 | + SKILL.md # Prepare, execute, collect, score, report |
| 36 | + scripts/ |
| 37 | + workspace.py # Workspace creation, batch.yaml, symlinks |
| 38 | + execute.py # Skill execution via agent runner |
| 39 | + collect.py # Artifact collection + case mapping |
| 40 | + score.py # Scoring: inline checks, LLM judges, pairwise, regression |
| 41 | + prompts/ |
| 42 | + analyze-results.md # Results interpretation prompt |
| 43 | + comparison-judge.md # Pairwise comparison judge prompt |
| 44 | +
|
| 45 | +skills/eval-mlflow/ # Skill: MLflow integration |
| 46 | + SKILL.md # Dataset sync, result logging, trace feedback |
| 47 | +
|
| 48 | +skills/eval-optimize/ # Skill: automated refinement loop |
| 49 | + SKILL.md # Composes with /eval-run via Skill tool |
| 50 | +``` |
| 51 | + |
| 52 | +## How It Works |
| 53 | + |
| 54 | +Skills projects create an `eval.yaml` config file with: |
| 55 | +- `dataset.schema` — natural language description of case structure (inputs, references) |
| 56 | +- `outputs` — list of artifact dirs with natural language schemas describing what the skill produces |
| 57 | +- `judges` — inline `check` scripts, LLM prompts, or external code judges |
| 58 | +- `thresholds` — regression detection |
| 59 | + |
| 60 | +The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names. |
| 61 | + |
| 62 | +## Usage |
| 63 | + |
| 64 | +``` |
| 65 | +/eval-setup # Setup: dependencies, MLflow, API keys |
| 66 | +/eval-analyze --skill my-skill # Analyze: understand skill, generate eval.yaml |
| 67 | +/eval-run --model opus # Run: execute eval suite |
| 68 | +/eval-mlflow --run-id <id> # MLflow: sync dataset, log results |
| 69 | +/eval-optimize --model opus # Optimize: iteratively improve skill |
| 70 | +``` |
| 71 | + |
| 72 | +## Key Design Decisions |
| 73 | + |
| 74 | +1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files |
| 75 | +2. **Agent-agnostic runner** — `EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK |
| 76 | +3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function` |
| 77 | +4. **MLflow as separate skill** — `/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it |
| 78 | + |
| 79 | +## Remaining Work |
| 80 | + |
| 81 | +- Skills and refinement loop (`/eval-optimize` implementation) |
| 82 | +- MLflow tracing integration (extended transcript parser with subagent hierarchy) |
| 83 | +- CI integration patterns |
| 84 | +- Testing and documentation |
| 85 | +- Publish to PyPI or marketplace |
0 commit comments