Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
061a430
Scaffold agent eval harness skills
astefanutti Apr 3, 2026
0f69a04
Fix package discovery and make mlflow an optional dependency
astefanutti Apr 6, 2026
ee3a91e
Add Vertex AI and Google Cloud env vars to runner allowlist
astefanutti Apr 6, 2026
3913eed
Accept list-typed YAML inputs and flatten into batch
astefanutti Apr 6, 2026
a67662e
Symlink .claude subdirs when .claude is skipped for hooks
astefanutti Apr 6, 2026
6a36964
Add Anthropic/Vertex AI fallback for LLM judges
astefanutti Apr 6, 2026
891a25b
Add HTML report generation with agent analysis
astefanutti Apr 6, 2026
4844065
Accumulate token usage across stream-json result events
astefanutti Apr 6, 2026
98af039
Resolve {prompt} placeholder in skill arguments from batch.yaml
astefanutti Apr 6, 2026
a6f9fae
Improve report: full markdown rendering, inline HTML, no scroll
astefanutti Apr 6, 2026
e6ef181
Add batch_pattern for deterministic output-to-case mapping
astefanutti Apr 7, 2026
31d932b
Fix MLflow autolog hook using bare 'python' on macOS
astefanutti Apr 7, 2026
1a59a84
Carry over project permissions into workspace hook settings
astefanutti Apr 7, 2026
3dfc158
Add execution monitoring guidance to eval-run
astefanutti Apr 7, 2026
baace79
Improve judge writing guide in eval-yaml template
astefanutti Apr 7, 2026
638a1fe
Capture and display num_turns in run results and report
astefanutti Apr 7, 2026
8bae0dd
Fix token count in report to include cached input tokens
astefanutti Apr 7, 2026
24c9aa3
Polish report: analysis sections at top level, larger file text
astefanutti Apr 7, 2026
1fae550
Render shared outputs once instead of repeating per case
astefanutti Apr 7, 2026
6b699b1
Skip pairwise judges during regular scoring
astefanutti Apr 7, 2026
9099297
Grant workspace access to project root and additionalDirectories
astefanutti Apr 7, 2026
eeb660a
Sum num_turns across all result events
astefanutti Apr 7, 2026
c3e1204
Upgrade default judge model and improve score parsing
astefanutti Apr 7, 2026
42dbc93
Show judge type and model in scoring summary table
astefanutti Apr 7, 2026
f7100c4
Show full model ID and subagent model in report
astefanutti Apr 7, 2026
fe133e2
Show token breakdown with separate cache read/write counts
astefanutti Apr 7, 2026
d663c5f
Color-code numeric judge scores using thresholds
astefanutti Apr 7, 2026
8a35eef
Show agent version in report (e.g. Claude Code 2.1.92)
astefanutti Apr 7, 2026
cf9f64d
Read timeout and budget from runner_options in eval.yaml
astefanutti Apr 7, 2026
4e2b37a
Drop version suffix from default judge model
astefanutti Apr 7, 2026
28b2556
Refactor usage extraction and capture resolved model from init event
astefanutti Apr 8, 2026
bfaaf72
Improve pairwise judge JSON parsing reliability
astefanutti Apr 8, 2026
bb8cf20
Integrate pairwise results into scoring table and per-case badges
astefanutti Apr 8, 2026
3f34ef0
Document --subagent-model argument in eval-run
astefanutti Apr 8, 2026
bae7ee8
Count tokens and turns from assistant events instead of result events
astefanutti Apr 8, 2026
b357298
Track all distinct models used during execution
astefanutti Apr 8, 2026
4c17f59
Exit non-zero when regressions are detected during scoring
astefanutti Apr 8, 2026
dfa578f
Use modelUsage from result events for accurate token totals
astefanutti Apr 8, 2026
71f4c47
Show cache hit rate percentage in report token display
astefanutti Apr 8, 2026
dd1c1df
Handle single-file output paths in collect.py
astefanutti Apr 8, 2026
0edac00
Polish report: case backgrounds, iframe sizing, single-file outputs
astefanutti Apr 8, 2026
53adf1c
Clarify background execution: no pipes to avoid output buffering
astefanutti Apr 8, 2026
1330935
Inject MLflow tracing hook into eval workspace automatically
astefanutti Apr 10, 2026
2d9a50d
Fix log_table to pass dict of columns instead of list of rows
astefanutti Apr 10, 2026
0c24315
Inject synthetic user event and timestamps into stream-json
astefanutti Apr 12, 2026
109aa58
Set MLflow environment without Stop hook to avoid fragmented traces
astefanutti Apr 12, 2026
70b1d72
Build consolidated MLflow trace from stream-json log
astefanutti Apr 12, 2026
88bae22
Add preflight check step to eval-run
astefanutti Apr 12, 2026
dfc64dc
Expand symlinked paths in permission patterns for macOS
astefanutti Apr 12, 2026
7900485
Fix duplicated outputs when baseline is provided and improve pairwise…
astefanutti Apr 12, 2026
2dd836a
Capture background agent output files before session cleanup
astefanutti Apr 13, 2026
2cfe93e
Resolve subagent files from saved copies and add LLM reasoning spans
astefanutti Apr 13, 2026
c5e303b
Add tool result content and context to MLflow trace spans
astefanutti Apr 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "agent-eval-harness",
"version": "0.1.0",
"description": "Agent and skill evaluation harness with MLflow integration",
"author": {
"name": "opendatahub-io"
},
"homepage": "https://github.com/opendatahub-io/agent-eval-harness",
"repository": "https://github.com/opendatahub-io/agent-eval-harness",
"license": "Apache-2.0",
"keywords": [
"evaluation",
"testing",
"skills",
"agents",
"mlflow"
]
}
33 changes: 33 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Python
__pycache__/
*.py[cod]
*.egg-info/
*.egg
dist/
build/
.venv/
venv/

# Eval runs and state
eval/runs/
tmp/

# MLflow
mlflow.db
mlruns/

# Environment and secrets
.env
.env.*
*.key
*.pem
secrets/
credentials/

# IDE
.idea/
.vscode/
*.swp

# OS
.DS_Store
Comment thread
astefanutti marked this conversation as resolved.
118 changes: 118 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Agent Eval Harness

Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting.

## Project Status

Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc.

## Architecture

```
agent_eval/ # Python package (config, runner, state)
config.py # EvalConfig from eval.yaml
state.py # Shared state persistence (key-value store)
agent/
base.py # EvalRunner ABC + RunResult
claude_code.py # Claude Code CLI runner (claude --print)
mlflow/
experiment.py # MLflow experiment setup, server check, feedback logging
datasets.py # Dataset create/sync utilities
traces.py # Trace search and input extraction

skills/eval-setup/ # Skill: environment setup
SKILL.md # Dependencies, MLflow, API keys, directories
scripts/
check_env.py # Preflight environment checks

skills/eval-analyze/ # Skill: bootstrap eval config
SKILL.md # Analyze skill, generate eval.yaml + eval.md
scripts/
find_skills.py # Skill discovery (reads plugin.json for paths)
validate_eval.py # Config and memory validation
prompts/
analyze-skill.md # Skill analysis prompt
generate-eval-md.md # eval.md generation prompt
references/
eval-yaml-template.md # Full eval.yaml template for generation

skills/eval-dataset/ # Skill: generate test cases
SKILL.md # Bootstrap, expand, or extract cases from traces

skills/eval-run/ # Skill: execute eval suite
SKILL.md # Prepare, execute, collect, score, report
scripts/
workspace.py # Workspace creation, batch.yaml, symlinks
execute.py # Skill execution via agent runner
collect.py # Artifact collection + case mapping
score.py # Scoring: inline checks, LLM judges, pairwise, regression
report.py # HTML report generation (scoring summary, per-case details, diffs)
tools.py # PreToolUse hook for tool interception
prompts/
analyze-results.md # Results interpretation prompt
comparison-judge.md # Pairwise comparison judge prompt
references/
data-pipeline.md # Dataset → workspace → execution → scoring flow
tool-interception.md # Tool interception format and field reference

skills/eval-review/ # Skill: interactive human review
SKILL.md # Present results, collect feedback, propose changes
prompts/
review-results.md # Analysis framework for feedback patterns

skills/eval-mlflow/ # Skill: MLflow integration
SKILL.md # Dataset sync, result logging, trace feedback
scripts/
sync_dataset.py # Push cases to MLflow dataset registry
log_results.py # Log run params, metrics, artifacts to MLflow
attach_feedback.py # Push/pull feedback between harness and traces
from_traces.py # Extract inputs from production traces

skills/eval-optimize/ # Skill: automated refinement loop
SKILL.md # Composes with /eval-run via Skill tool
```

## How It Works

Skills projects create an `eval.yaml` config file with:
- `skill` — skill to evaluate
- `arguments` — arguments string passed to the skill invocation
- `runner` — agent runner (`claude-code`, etc.), `runner_options` for runner-specific settings
- `permissions` — `allow`/`deny` tool patterns for headless execution
- `dataset` — `path` to test cases directory, `schema` describing case structure in natural language
- `inputs.tools` — tool interception for headless eval: `match` describes what to intercept, `prompt` how to handle it
- `outputs` — list of artifact dirs (`path`) and/or tool calls (`tool`) with natural language schemas
- `traces` — execution data to capture: stdout/stderr, events, metrics (exit code, tokens, cost)
- `judges` — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function`
- `thresholds` — regression detection per judge

Runs are stored in `$AGENT_EVAL_RUNS_DIR` (default `eval/runs`), configured during `/eval-setup`.

The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names.

## Usage

```
/eval-setup # Setup: dependencies, MLflow, API keys
/eval-analyze --skill my-skill # Analyze: understand skill, generate eval.yaml
/eval-dataset # Dataset: generate test cases
/eval-run --model opus # Run: execute eval suite
/eval-review --run-id <id> # Review: interactive human feedback + changes
/eval-mlflow --run-id <id> # MLflow: sync dataset, log results
/eval-optimize --model opus # Optimize: automated refinement loop
```

## Key Design Decisions

1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files
2. **Agent-agnostic runner** — `EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK
3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function`
4. **MLflow as separate skill** — `/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it

## Remaining Work

- Skills and refinement loop (`/eval-optimize` implementation)
- MLflow tracing integration (extended transcript parser with subagent hierarchy)
- CI integration patterns
- Testing and documentation
- Publish to PyPI or marketplace
Loading