Skip to content

Commit 3fccefa

Browse files
authored
Merge pull request #1 from opendatahub-io/pr-01
Scaffold agent eval harness skills
2 parents 90f5996 + c5e303b commit 3fccefa

46 files changed

Lines changed: 8945 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude-plugin/plugin.json

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"name": "agent-eval-harness",
3+
"version": "0.1.0",
4+
"description": "Agent and skill evaluation harness with MLflow integration",
5+
"author": {
6+
"name": "opendatahub-io"
7+
},
8+
"homepage": "https://github.com/opendatahub-io/agent-eval-harness",
9+
"repository": "https://github.com/opendatahub-io/agent-eval-harness",
10+
"license": "Apache-2.0",
11+
"keywords": [
12+
"evaluation",
13+
"testing",
14+
"skills",
15+
"agents",
16+
"mlflow"
17+
]
18+
}

.gitignore

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*.egg-info/
5+
*.egg
6+
dist/
7+
build/
8+
.venv/
9+
venv/
10+
11+
# Eval runs and state
12+
eval/runs/
13+
tmp/
14+
15+
# MLflow
16+
mlflow.db
17+
mlruns/
18+
19+
# Environment and secrets
20+
.env
21+
.env.*
22+
*.key
23+
*.pem
24+
secrets/
25+
credentials/
26+
27+
# IDE
28+
.idea/
29+
.vscode/
30+
*.swp
31+
32+
# OS
33+
.DS_Store

CLAUDE.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Agent Eval Harness
2+
3+
Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting.
4+
5+
## Project Status
6+
7+
Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc.
8+
9+
## Architecture
10+
11+
```
12+
agent_eval/ # Python package (config, runner, state)
13+
config.py # EvalConfig from eval.yaml
14+
state.py # Shared state persistence (key-value store)
15+
agent/
16+
base.py # EvalRunner ABC + RunResult
17+
claude_code.py # Claude Code CLI runner (claude --print)
18+
mlflow/
19+
experiment.py # MLflow experiment setup, server check, feedback logging
20+
datasets.py # Dataset create/sync utilities
21+
traces.py # Trace search and input extraction
22+
23+
skills/eval-setup/ # Skill: environment setup
24+
SKILL.md # Dependencies, MLflow, API keys, directories
25+
scripts/
26+
check_env.py # Preflight environment checks
27+
28+
skills/eval-analyze/ # Skill: bootstrap eval config
29+
SKILL.md # Analyze skill, generate eval.yaml + eval.md
30+
scripts/
31+
find_skills.py # Skill discovery (reads plugin.json for paths)
32+
validate_eval.py # Config and memory validation
33+
prompts/
34+
analyze-skill.md # Skill analysis prompt
35+
generate-eval-md.md # eval.md generation prompt
36+
references/
37+
eval-yaml-template.md # Full eval.yaml template for generation
38+
39+
skills/eval-dataset/ # Skill: generate test cases
40+
SKILL.md # Bootstrap, expand, or extract cases from traces
41+
42+
skills/eval-run/ # Skill: execute eval suite
43+
SKILL.md # Prepare, execute, collect, score, report
44+
scripts/
45+
workspace.py # Workspace creation, batch.yaml, symlinks
46+
execute.py # Skill execution via agent runner
47+
collect.py # Artifact collection + case mapping
48+
score.py # Scoring: inline checks, LLM judges, pairwise, regression
49+
report.py # HTML report generation (scoring summary, per-case details, diffs)
50+
tools.py # PreToolUse hook for tool interception
51+
prompts/
52+
analyze-results.md # Results interpretation prompt
53+
comparison-judge.md # Pairwise comparison judge prompt
54+
references/
55+
data-pipeline.md # Dataset → workspace → execution → scoring flow
56+
tool-interception.md # Tool interception format and field reference
57+
58+
skills/eval-review/ # Skill: interactive human review
59+
SKILL.md # Present results, collect feedback, propose changes
60+
prompts/
61+
review-results.md # Analysis framework for feedback patterns
62+
63+
skills/eval-mlflow/ # Skill: MLflow integration
64+
SKILL.md # Dataset sync, result logging, trace feedback
65+
scripts/
66+
sync_dataset.py # Push cases to MLflow dataset registry
67+
log_results.py # Log run params, metrics, artifacts to MLflow
68+
attach_feedback.py # Push/pull feedback between harness and traces
69+
from_traces.py # Extract inputs from production traces
70+
71+
skills/eval-optimize/ # Skill: automated refinement loop
72+
SKILL.md # Composes with /eval-run via Skill tool
73+
```
74+
75+
## How It Works
76+
77+
Skills projects create an `eval.yaml` config file with:
78+
- `skill` — skill to evaluate
79+
- `arguments` — arguments string passed to the skill invocation
80+
- `runner` — agent runner (`claude-code`, etc.), `runner_options` for runner-specific settings
81+
- `permissions``allow`/`deny` tool patterns for headless execution
82+
- `dataset``path` to test cases directory, `schema` describing case structure in natural language
83+
- `inputs.tools` — tool interception for headless eval: `match` describes what to intercept, `prompt` how to handle it
84+
- `outputs` — list of artifact dirs (`path`) and/or tool calls (`tool`) with natural language schemas
85+
- `traces` — execution data to capture: stdout/stderr, events, metrics (exit code, tokens, cost)
86+
- `judges` — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function`
87+
- `thresholds` — regression detection per judge
88+
89+
Runs are stored in `$AGENT_EVAL_RUNS_DIR` (default `eval/runs`), configured during `/eval-setup`.
90+
91+
The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names.
92+
93+
## Usage
94+
95+
```
96+
/eval-setup # Setup: dependencies, MLflow, API keys
97+
/eval-analyze --skill my-skill # Analyze: understand skill, generate eval.yaml
98+
/eval-dataset # Dataset: generate test cases
99+
/eval-run --model opus # Run: execute eval suite
100+
/eval-review --run-id <id> # Review: interactive human feedback + changes
101+
/eval-mlflow --run-id <id> # MLflow: sync dataset, log results
102+
/eval-optimize --model opus # Optimize: automated refinement loop
103+
```
104+
105+
## Key Design Decisions
106+
107+
1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files
108+
2. **Agent-agnostic runner**`EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK
109+
3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function`
110+
4. **MLflow as separate skill**`/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it
111+
112+
## Remaining Work
113+
114+
- Skills and refinement loop (`/eval-optimize` implementation)
115+
- MLflow tracing integration (extended transcript parser with subagent hierarchy)
116+
- CI integration patterns
117+
- Testing and documentation
118+
- Publish to PyPI or marketplace

0 commit comments

Comments
 (0)