Skip to content

Commit bc48951

Browse files
committed
Scaffold agent eval harness skills
1 parent 90f5996 commit bc48951

30 files changed

Lines changed: 3370 additions & 0 deletions

.claude-plugin/plugin.json

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"name": "agent-eval-harness",
3+
"version": "0.1.0",
4+
"description": "Agent and skill evaluation harness with MLflow integration",
5+
"author": {
6+
"name": "opendatahub-io"
7+
},
8+
"homepage": "https://github.com/opendatahub-io/agent-eval-harness",
9+
"repository": "https://github.com/opendatahub-io/agent-eval-harness",
10+
"license": "Apache-2.0",
11+
"keywords": [
12+
"evaluation",
13+
"testing",
14+
"skills",
15+
"agents",
16+
"mlflow"
17+
]
18+
}

.gitignore

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*.egg-info/
5+
*.egg
6+
dist/
7+
build/
8+
.venv/
9+
venv/
10+
11+
# Eval runs and state
12+
eval/runs/
13+
tmp/
14+
15+
# MLflow
16+
mlflow.db
17+
mlruns/
18+
19+
# IDE
20+
.idea/
21+
.vscode/
22+
*.swp
23+
24+
# OS
25+
.DS_Store

CLAUDE.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Agent Eval Harness
2+
3+
Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting.
4+
5+
## Project Status
6+
7+
Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc.
8+
9+
## Architecture
10+
11+
```
12+
agent_eval/ # Python package (config, runner, state)
13+
config.py # EvalConfig from eval.yaml
14+
state.py # Shared state persistence (key-value store)
15+
agent/
16+
base.py # EvalRunner ABC + RunResult
17+
claude_code.py # Claude Code CLI runner (claude --print)
18+
mlflow/
19+
experiment.py # MLflow experiment setup (used by eval-setup)
20+
21+
skills/eval-setup/ # Skill: environment setup
22+
SKILL.md # Dependencies, MLflow, API keys, directories
23+
scripts/
24+
check_env.py # Preflight environment checks
25+
26+
skills/eval-analyze/ # Skill: bootstrap eval config
27+
SKILL.md # Analyze skill, generate eval.yaml + eval.md
28+
scripts/
29+
discover.py # Skills and config discovery
30+
prompts/
31+
analyze-skill.md # Skill analysis prompt
32+
generate-eval-md.md # eval.md generation prompt
33+
34+
skills/eval-run/ # Skill: execute eval suite
35+
SKILL.md # Prepare, execute, collect, score, report
36+
scripts/
37+
workspace.py # Workspace creation, batch.yaml, symlinks
38+
execute.py # Skill execution via agent runner
39+
collect.py # Artifact collection + case mapping
40+
score.py # Scoring: inline checks, LLM judges, pairwise, regression
41+
prompts/
42+
analyze-results.md # Results interpretation prompt
43+
comparison-judge.md # Pairwise comparison judge prompt
44+
45+
skills/eval-mlflow/ # Skill: MLflow integration
46+
SKILL.md # Dataset sync, result logging, trace feedback
47+
48+
skills/eval-optimize/ # Skill: automated refinement loop
49+
SKILL.md # Composes with /eval-run via Skill tool
50+
```
51+
52+
## How It Works
53+
54+
Skills projects create an `eval.yaml` config file with:
55+
- `dataset.schema` — natural language description of case structure (inputs, references)
56+
- `outputs` — list of artifact dirs with natural language schemas describing what the skill produces
57+
- `judges` — inline `check` scripts, LLM prompts, or external code judges
58+
- `thresholds` — regression detection
59+
60+
The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names.
61+
62+
## Usage
63+
64+
```
65+
/eval-setup # Setup: dependencies, MLflow, API keys
66+
/eval-analyze --skill my-skill # Analyze: understand skill, generate eval.yaml
67+
/eval-run --model opus # Run: execute eval suite
68+
/eval-mlflow --run-id <id> # MLflow: sync dataset, log results
69+
/eval-optimize --model opus # Optimize: iteratively improve skill
70+
```
71+
72+
## Key Design Decisions
73+
74+
1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files
75+
2. **Agent-agnostic runner**`EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK
76+
3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function`
77+
4. **MLflow as separate skill**`/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it
78+
79+
## Remaining Work
80+
81+
- Skills and refinement loop (`/eval-optimize` implementation)
82+
- MLflow tracing integration (extended transcript parser with subagent hierarchy)
83+
- CI integration patterns
84+
- Testing and documentation
85+
- Publish to PyPI or marketplace

0 commit comments

Comments
 (0)