opendatahub-io
diff --git a/‎.claude-plugin/plugin.json‎
Lines changed: 18 additions & 0 deletions b/‎.claude-plugin/plugin.json‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 25 additions & 0 deletions b/‎.gitignore‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 85 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 85 additions & 0 deletions
@@ -0,0 +1,18 @@
+{
+  "name": "agent-eval-harness",
+  "version": "0.1.0",
+  "description": "Agent and skill evaluation harness with MLflow integration",
+  "author": {
+    "name": "opendatahub-io"
+  },
+  "homepage": "https://github.com/opendatahub-io/agent-eval-harness",
+  "repository": "https://github.com/opendatahub-io/agent-eval-harness",
+  "license": "Apache-2.0",
+  "keywords": [
+    "evaluation",
+    "testing",
+    "skills",
+    "agents",
+    "mlflow"
+  ]
+}
@@ -0,0 +1,25 @@
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+*.egg
+dist/
+build/
+.venv/
+venv/
+
+# Eval runs and state
+eval/runs/
+tmp/
+
+# MLflow
+mlflow.db
+mlruns/
+
+# IDE
+.idea/
+.vscode/
+*.swp
+
+# OS
+.DS_Store
@@ -0,0 +1,85 @@
+# Agent Eval Harness
+
+Generic evaluation framework for Claude Code skills projects. Uses MLflow as the backbone for tracing, evaluation, datasets, and reporting.
+
+## Project Status
+
+Phase 1 (core framework) and Phase 2 (scoring integration) are implemented. See `eval/plans/agent-eval-harness-design.md` in the rfe-creator project for the full design doc.
+
+## Architecture
+
+```
+agent_eval/              # Python package (config, runner, state)
+  config.py              # EvalConfig from eval.yaml
+  state.py               # Shared state persistence (key-value store)
+  agent/
+    base.py              # EvalRunner ABC + RunResult
+    claude_code.py       # Claude Code CLI runner (claude --print)
+  mlflow/
+    experiment.py        # MLflow experiment setup (used by eval-setup)
+
+skills/eval-setup/       # Skill: environment setup
+  SKILL.md               # Dependencies, MLflow, API keys, directories
+  scripts/
+    check_env.py         # Preflight environment checks
+
+skills/eval-analyze/     # Skill: bootstrap eval config
+  SKILL.md               # Analyze skill, generate eval.yaml + eval.md
+  scripts/
+    discover.py          # Skills and config discovery
+  prompts/
+    analyze-skill.md     # Skill analysis prompt
+    generate-eval-md.md  # eval.md generation prompt
+
+skills/eval-run/         # Skill: execute eval suite
+  SKILL.md               # Prepare, execute, collect, score, report
+  scripts/
+    workspace.py         # Workspace creation, batch.yaml, symlinks
+    execute.py           # Skill execution via agent runner
+    collect.py           # Artifact collection + case mapping
+    score.py             # Scoring: inline checks, LLM judges, pairwise, regression
+  prompts/
+    analyze-results.md   # Results interpretation prompt
+    comparison-judge.md  # Pairwise comparison judge prompt
+
+skills/eval-mlflow/      # Skill: MLflow integration
+  SKILL.md               # Dataset sync, result logging, trace feedback
+
+skills/eval-optimize/    # Skill: automated refinement loop
+  SKILL.md               # Composes with /eval-run via Skill tool
+```
+
+## How It Works
+
+Skills projects create an `eval.yaml` config file with:
+- `dataset.schema` — natural language description of case structure (inputs, references)
+- `outputs` — list of artifact dirs with natural language schemas describing what the skill produces
+- `judges` — inline `check` scripts, LLM prompts, or external code judges
+- `thresholds` — regression detection
+
+The `schema` descriptions are documentation for the LLM agents and judges. Scripts operate on file paths from eval.yaml directly — no extraction spec, no hardcoded field names.
+
+## Usage
+
+```
+/eval-setup                            # Setup: dependencies, MLflow, API keys
+/eval-analyze --skill my-skill         # Analyze: understand skill, generate eval.yaml
+/eval-run --model opus                 # Run: execute eval suite
+/eval-mlflow --run-id <id>             # MLflow: sync dataset, log results
+/eval-optimize --model opus            # Optimize: iteratively improve skill
+```
+
+## Key Design Decisions
+
+1. **Schema-driven** — dataset and output structures described in natural language in eval.yaml; agents and judges interpret them, scripts just move files
+2. **Agent-agnostic runner** — `EvalRunner` ABC with `--agent` flag on execute.py; Claude Code included, extensible to OpenCode/Agent SDK
+3. **Three judge types** — inline `check` scripts, LLM `prompt`/`prompt_file`, external `module`/`function`
+4. **MLflow as separate skill** — `/eval-mlflow` handles dataset sync, result logging, trace feedback; eval-run works without it
+
+## Remaining Work
+
+- Skills and refinement loop (`/eval-optimize` implementation)
+- MLflow tracing integration (extended transcript parser with subagent hierarchy)
+- CI integration patterns
+- Testing and documentation
+- Publish to PyPI or marketplace