-
Notifications
You must be signed in to change notification settings - Fork 6
Scaffold agent eval harness skills #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
53 commits
Select commit
Hold shift + click to select a range
061a430
Scaffold agent eval harness skills
astefanutti 0f69a04
Fix package discovery and make mlflow an optional dependency
astefanutti ee3a91e
Add Vertex AI and Google Cloud env vars to runner allowlist
astefanutti 3913eed
Accept list-typed YAML inputs and flatten into batch
astefanutti a67662e
Symlink .claude subdirs when .claude is skipped for hooks
astefanutti 6a36964
Add Anthropic/Vertex AI fallback for LLM judges
astefanutti 891a25b
Add HTML report generation with agent analysis
astefanutti 4844065
Accumulate token usage across stream-json result events
astefanutti 98af039
Resolve {prompt} placeholder in skill arguments from batch.yaml
astefanutti a6f9fae
Improve report: full markdown rendering, inline HTML, no scroll
astefanutti e6ef181
Add batch_pattern for deterministic output-to-case mapping
astefanutti 31d932b
Fix MLflow autolog hook using bare 'python' on macOS
astefanutti 1a59a84
Carry over project permissions into workspace hook settings
astefanutti 3dfc158
Add execution monitoring guidance to eval-run
astefanutti baace79
Improve judge writing guide in eval-yaml template
astefanutti 638a1fe
Capture and display num_turns in run results and report
astefanutti 8bae0dd
Fix token count in report to include cached input tokens
astefanutti 24c9aa3
Polish report: analysis sections at top level, larger file text
astefanutti 1fae550
Render shared outputs once instead of repeating per case
astefanutti 6b699b1
Skip pairwise judges during regular scoring
astefanutti 9099297
Grant workspace access to project root and additionalDirectories
astefanutti eeb660a
Sum num_turns across all result events
astefanutti c3e1204
Upgrade default judge model and improve score parsing
astefanutti 42dbc93
Show judge type and model in scoring summary table
astefanutti f7100c4
Show full model ID and subagent model in report
astefanutti fe133e2
Show token breakdown with separate cache read/write counts
astefanutti d663c5f
Color-code numeric judge scores using thresholds
astefanutti 8a35eef
Show agent version in report (e.g. Claude Code 2.1.92)
astefanutti cf9f64d
Read timeout and budget from runner_options in eval.yaml
astefanutti 4e2b37a
Drop version suffix from default judge model
astefanutti 28b2556
Refactor usage extraction and capture resolved model from init event
astefanutti bfaaf72
Improve pairwise judge JSON parsing reliability
astefanutti bb8cf20
Integrate pairwise results into scoring table and per-case badges
astefanutti 3f34ef0
Document --subagent-model argument in eval-run
astefanutti bae7ee8
Count tokens and turns from assistant events instead of result events
astefanutti b357298
Track all distinct models used during execution
astefanutti 4c17f59
Exit non-zero when regressions are detected during scoring
astefanutti dfa578f
Use modelUsage from result events for accurate token totals
astefanutti 71f4c47
Show cache hit rate percentage in report token display
astefanutti dd1c1df
Handle single-file output paths in collect.py
astefanutti 0edac00
Polish report: case backgrounds, iframe sizing, single-file outputs
astefanutti 53adf1c
Clarify background execution: no pipes to avoid output buffering
astefanutti 1330935
Inject MLflow tracing hook into eval workspace automatically
astefanutti 2d9a50d
Fix log_table to pass dict of columns instead of list of rows
astefanutti 0c24315
Inject synthetic user event and timestamps into stream-json
astefanutti 109aa58
Set MLflow environment without Stop hook to avoid fragmented traces
astefanutti 70b1d72
Build consolidated MLflow trace from stream-json log
astefanutti 88bae22
Add preflight check step to eval-run
astefanutti dfc64dc
Expand symlinked paths in permission patterns for macOS
astefanutti 7900485
Fix duplicated outputs when baseline is provided and improve pairwise…
astefanutti 2dd836a
Capture background agent output files before session cleanup
astefanutti 2cfe93e
Resolve subagent files from saved copies and add LLM reasoning spans
astefanutti c5e303b
Add tool result content and context to MLflow trace spans
astefanutti File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| { | ||
| "name": "agent-eval-harness", | ||
| "version": "0.1.0", | ||
| "description": "Agent and skill evaluation harness with MLflow integration", | ||
| "author": { | ||
| "name": "opendatahub-io" | ||
| }, | ||
| "homepage": "https://github.com/opendatahub-io/agent-eval-harness", | ||
| "repository": "https://github.com/opendatahub-io/agent-eval-harness", | ||
| "license": "Apache-2.0", | ||
| "keywords": [ | ||
| "evaluation", | ||
| "testing", | ||
| "skills", | ||
| "agents", | ||
| "mlflow" | ||
| ] | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # Python | ||
| __pycache__/ | ||
| *.py[cod] | ||
| *.egg-info/ | ||
| *.egg | ||
| dist/ | ||
| build/ | ||
| .venv/ | ||
| venv/ | ||
|
|
||
| # Eval runs and state | ||
| eval/runs/ | ||
| tmp/ | ||
|
|
||
| # MLflow | ||
| mlflow.db | ||
| mlruns/ | ||
|
|
||
| # Environment and secrets | ||
| .env | ||
| .env.* | ||
| *.key | ||
| *.pem | ||
| secrets/ | ||
| credentials/ | ||
|
|
||
| # IDE | ||
| .idea/ | ||
| .vscode/ | ||
| *.swp | ||
|
|
||
| # OS | ||
| .DS_Store | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| # Agent Eval Harness | ||
|
|
||
| Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting. | ||
|
|
||
| ## Project Status | ||
|
|
||
| Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| agent_eval/ # Python package (config, runner, state) | ||
| config.py # EvalConfig from eval.yaml | ||
| state.py # Shared state persistence (key-value store) | ||
| agent/ | ||
| base.py # EvalRunner ABC + RunResult | ||
| claude_code.py # Claude Code CLI runner (claude --print) | ||
| mlflow/ | ||
| experiment.py # MLflow experiment setup, server check, feedback logging | ||
| datasets.py # Dataset create/sync utilities | ||
| traces.py # Trace search and input extraction | ||
|
|
||
| skills/eval-setup/ # Skill: environment setup | ||
| SKILL.md # Dependencies, MLflow, API keys, directories | ||
| scripts/ | ||
| check_env.py # Preflight environment checks | ||
|
|
||
| skills/eval-analyze/ # Skill: bootstrap eval config | ||
| SKILL.md # Analyze skill, generate eval.yaml + eval.md | ||
| scripts/ | ||
| find_skills.py # Skill discovery (reads plugin.json for paths) | ||
| validate_eval.py # Config and memory validation | ||
| prompts/ | ||
| analyze-skill.md # Skill analysis prompt | ||
| generate-eval-md.md # eval.md generation prompt | ||
| references/ | ||
| eval-yaml-template.md # Full eval.yaml template for generation | ||
|
|
||
| skills/eval-dataset/ # Skill: generate test cases | ||
| SKILL.md # Bootstrap, expand, or extract cases from traces | ||
|
|
||
| skills/eval-run/ # Skill: execute eval suite | ||
| SKILL.md # Prepare, execute, collect, score, report | ||
| scripts/ | ||
| workspace.py # Workspace creation, batch.yaml, symlinks | ||
| execute.py # Skill execution via agent runner | ||
| collect.py # Artifact collection + case mapping | ||
| score.py # Scoring: inline checks, LLM judges, pairwise, regression | ||
| report.py # HTML report generation (scoring summary, per-case details, diffs) | ||
| tools.py # PreToolUse hook for tool interception | ||
| prompts/ | ||
| analyze-results.md # Results interpretation prompt | ||
| comparison-judge.md # Pairwise comparison judge prompt | ||
| references/ | ||
| data-pipeline.md # Dataset → workspace → execution → scoring flow | ||
| tool-interception.md # Tool interception format and field reference | ||
|
|
||
| skills/eval-review/ # Skill: interactive human review | ||
| SKILL.md # Present results, collect feedback, propose changes | ||
| prompts/ | ||
| review-results.md # Analysis framework for feedback patterns | ||
|
|
||
| skills/eval-mlflow/ # Skill: MLflow integration | ||
| SKILL.md # Dataset sync, result logging, trace feedback | ||
| scripts/ | ||
| sync_dataset.py # Push cases to MLflow dataset registry | ||
| log_results.py # Log run params, metrics, artifacts to MLflow | ||
| attach_feedback.py # Push/pull feedback between harness and traces | ||
| from_traces.py # Extract inputs from production traces | ||
|
|
||
| skills/eval-optimize/ # Skill: automated refinement loop | ||
| SKILL.md # Composes with /eval-run via Skill tool | ||
| ``` | ||
|
|
||
| ## How It Works | ||
|
|
||
| Skills projects create an `eval.yaml` config file with: | ||
| - `skill` — skill to evaluate | ||
| - `arguments` — arguments string passed to the skill invocation | ||
| - `runner` — agent runner (`claude-code`, etc.), `runner_options` for runner-specific settings | ||
| - `permissions` — `allow`/`deny` tool patterns for headless execution | ||
| - `dataset` — `path` to test cases directory, `schema` describing case structure in natural language | ||
| - `inputs.tools` — tool interception for headless eval: `match` describes what to intercept, `prompt` how to handle it | ||
| - `outputs` — list of artifact dirs (`path`) and/or tool calls (`tool`) with natural language schemas | ||
| - `traces` — execution data to capture: stdout/stderr, events, metrics (exit code, tokens, cost) | ||
| - `judges` — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function` | ||
| - `thresholds` — regression detection per judge | ||
|
|
||
| Runs are stored in `$AGENT_EVAL_RUNS_DIR` (default `eval/runs`), configured during `/eval-setup`. | ||
|
|
||
| The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names. | ||
|
|
||
| ## Usage | ||
|
|
||
| ``` | ||
| /eval-setup # Setup: dependencies, MLflow, API keys | ||
| /eval-analyze --skill my-skill # Analyze: understand skill, generate eval.yaml | ||
| /eval-dataset # Dataset: generate test cases | ||
| /eval-run --model opus # Run: execute eval suite | ||
| /eval-review --run-id <id> # Review: interactive human feedback + changes | ||
| /eval-mlflow --run-id <id> # MLflow: sync dataset, log results | ||
| /eval-optimize --model opus # Optimize: automated refinement loop | ||
| ``` | ||
|
|
||
| ## Key Design Decisions | ||
|
|
||
| 1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files | ||
| 2. **Agent-agnostic runner** — `EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK | ||
| 3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function` | ||
| 4. **MLflow as separate skill** — `/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it | ||
|
|
||
| ## Remaining Work | ||
|
|
||
| - Skills and refinement loop (`/eval-optimize` implementation) | ||
| - MLflow tracing integration (extended transcript parser with subagent hierarchy) | ||
| - CI integration patterns | ||
| - Testing and documentation | ||
| - Publish to PyPI or marketplace |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.