Skip to content

Scaffold agent eval harness skills#1

Merged
astefanutti merged 53 commits intomainfrom
pr-01
Apr 13, 2026
Merged

Scaffold agent eval harness skills#1
astefanutti merged 53 commits intomainfrom
pr-01

Conversation

@astefanutti
Copy link
Copy Markdown
Contributor

@astefanutti astefanutti commented Apr 3, 2026

Generic evaluation framework for agents and skills: analyze, run, score, and improve skills across agent harnesses (Claude Code, OpenCode, ...) and models.

Features

  • Natural language evaluation — Describe dataset structure and skill outputs in plain English in eval.yaml. LLM agents interpret the descriptions to prepare, collect, and score.
  • Agent-agnostic execution — Pluggable runner abstraction (EvalRunner ABC). Ships with Claude Code; extensible to OpenCode, Agent SDK, or any CLI-based agent. Generic permissions (allow/deny) plus runner-specific options.
  • Flexible judges — Inline Python checks, LLM judges with prompt files and context, or external code modules.
  • MLflow integration — via a dedicated /eval-mlflow skill. Sync datasets, log metrics, attach judge feedback to traces. The eval pipeline works without MLflow.
  • Automated refinement/eval-optimize reads failures, edits the skill, re-runs, and checks for regressions in a loop.

Skills

Skill Purpose
/eval-setup Environment setup: Dependencies, MLflow, API keys, directories
/eval-analyze Analyze skill recursively, generate eval.yaml + eval.md
/eval-dataset Generate test cases: bootstrap, expand, or extract from MLflow traces
/eval-run Prepare, execute, collect, score, report
/eval-review Interactive human review: present results, collect feedback, propose SKILL.md
/eval-mlflow Sync dataset, log results, trace feedback
/eval-optimize Automated improvement loop: analyze failures, edit skill, re-run, check regressions

Summary by CodeRabbit

  • New Features

    • Added Agent Eval Harness v0.1.0: end-to-end eval workflows (setup, analyze, dataset, run, review, MLflow sync, optimize)
    • Headless execution with pluggable runners, standardized run results, artifact collection, judges (inline/LLM/external), pairwise and regression detection
    • MLflow experiment integration and workspace/tool interception for automated runs
  • Documentation

    • Full README, design notes, examples, templates, and CLI usage guides for eval.yaml and workflows
  • Chores

    • Project metadata, packaging, and ignore rules added

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces the Agent Eval Harness: a new Python package (agent_eval) and CLI-driven skill suite for MLflow-backed evaluation of Claude Code skills. It adds configuration parsing (agent_eval/config.py), a runner abstraction and a Claude Code runner (agent_eval/agent/*), MLflow helpers (agent_eval/mlflow/*), filesystem state tooling (agent_eval/state.py), workspace/collection/execute/score scripts under skills/eval-run/scripts, tool interception hooks, and multiple user-invocable skill specs (eval-setup, eval-analyze, eval-dataset, eval-run, eval-mlflow, eval-review, eval-optimize). The repo gains packaging/docs (pyproject.toml, README/CLAUDE.md), sample eval.yaml, example eval workspaces, and a plugin manifest.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Security Issues

CWE-78 (OS Command Injection) — agent_eval/agent/claude_code.py

  • Unescaped user input is embedded into CLI prompts/arguments and passed to subprocesses; sanitize/validate inputs or use structured argument passing and avoid shell=True.

CWE-400 (Uncontrolled Resource Consumption) — agent_eval/agent/claude_code.py

  • Subprocesses are only bounded by wall-clock timeout; enforce resource limits (e.g., resource.setrlimit) or run under sandboxed cgroups.

CWE-22 (Path Traversal) — skills/eval-run/scripts/collect.py and workspace.py

  • Relative-path checks reject .. but do not validate resolved paths post-symlink; validate Path.resolve() stays within intended directory roots before reading/writing.

CWE-73 (External Control of File Name or Path) — skills/eval-run/scripts/workspace.py

  • Symlink creation to project paths uses unvalidated targets; explicitly whitelist targets and validate resolved destinations.

CWE-434 (Unrestricted File Upload) — skills/eval-run/scripts/collect.py

  • Artifacts are copied without content/type/size checks; restrict accepted file types/extensions or scan content before storing/executing.

CWE-377 (Insecure Temporary File/Directory) — skills/eval-run/scripts/workspace.py

  • Files written inside workspaces inherit process umask; set explicit, restrictive file permissions for sensitive files (e.g., settings, credentials).

CWE-269 (Improper Access Control / Policy Enforcement) — permissions handling

  • The permissions config is passed into runners but lack verification/audit that denies were enforced. Log/verify policy matches at runtime and fail closed on enforcement failures.

CWE-522 / CVE-risk (Credential Exposure) — skills/eval-setup/scripts/check_env.py & mlflow integration

  • Scripts read and suggest environment-based credentials (Anthropic/MLflow) and may write settings files; ensure secrets are not written with lax permissions and avoid echoing secrets to logs.

Code Quality & Design Issues

High-complexity conditional branching — skills/eval-run/scripts/score.py

  • Judge instantiation supports inline Python, LLM-based judges, and external modules in ad-hoc branches. Introduce a formal Judge interface/factory to centralize error handling and testing.

Fragile subprocess output parsing — agent_eval/agent/claude_code.py

  • Line-oriented JSON parsing assumes one event per line. Use a resilient streaming JSON parser or robust framing to avoid silent parse failures.

Global state / logging concurrency — agent_eval/agent/claude_code.py

  • Module-level print or lock-based progress output is brittle under concurrent runs. Replace with thread-safe logging handlers and structured logs.

Insufficient input validation — agent_eval/state.py and workspace tooling

  • ID lists, case IDs, and other user-supplied identifiers are accepted with minimal sanitization; enforce strict character sets and escape/serialize when writing YAML/JSON.

Silent fallback and error reporting — agent_eval/mlflow/experiment.py and execute.py

  • MLflow import/availability is silently degraded; callers cannot distinguish missing dependency vs transient error. Return deterministic error codes/structured error objects and avoid hiding stderr.

Missing schema versioning — agent_eval/config.py

  • EvalConfig lacks explicit schema/version metadata and migration hooks; add a schema_version field and validation/migration path.

Architectural Concerns

Tight coupling to MLflow and external LLM providers

  • MLflow is used directly across code paths and LLM judge creation depends on mlflow.genai or Anthropic; consider dependency inversion (pluggable telemetry/judge backends) and clear runtime failures when providers are unavailable.

Wide attack surface from executing inline Python judges

  • Inline check execution compiles and runs arbitrary code from config; this enables remote code execution risk if configs are attacker-controlled. Restrict inline checks to a sandboxed evaluator or require signed/trusted judge modules.

Broad filesystem and subprocess privileges

  • Workspace creation, symlink mirroring, and subprocess execution operate with host-level permissions. For untrusted or third-party skills, run evaluations in isolated containers or VMs and minimize host filesystem exposure.

No eval.yaml schema/version enforcement or migrations

  • Lack of schema versioning means future changes risk silent misinterpretation. Add explicit schema versioning, strict validation, and migration tooling.

Recommendation (actionable): before merging, address the above critical security items: sanitize subprocess inputs, validate resolved filesystem paths (post-symlink), restrict and sandbox inline judge execution, enforce resource limits for subprocesses, and ensure secrets and generated config files are written with strict permissions.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Scaffold agent eval harness skills' accurately describes the primary change: introducing a comprehensive evaluation framework with multiple skill definitions and supporting infrastructure.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (19)
agent_eval/__main__.py-116-123 (1)

116-123: ⚠️ Potential issue | 🟠 Major

Validate --data payload shape and fail fast when SKILL.md is missing.

Line 116 can parse to scalar/None, then Line 130 raises AttributeError. Also, when no SKILL.md is found (Line 118-123), an empty skill_hash is written, creating invalid eval metadata.

Proposed fix
-    data = yaml.safe_load(data_yaml)
+    data = yaml.safe_load(data_yaml) or {}
+    if not isinstance(data, dict):
+        print("Invalid --data: expected YAML mapping", file=sys.stderr)
+        sys.exit(1)
@@
     skill_hash = ""
@@
         if skill_path.exists():
-            skill_hash = hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]
+            skill_hash = hashlib.sha256(skill_path.read_bytes()).hexdigest()[:12]
             break
+    if not skill_hash:
+        print(f"STALE: skill {skill} not found", file=sys.stderr)
+        sys.exit(1)
@@
-    fm.update(data.get("frontmatter", data))
+    frontmatter = data.get("frontmatter", data)
+    if not isinstance(frontmatter, dict):
+        print("Invalid --data.frontmatter: expected mapping", file=sys.stderr)
+        sys.exit(1)
+    fm.update(frontmatter)

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 130-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/__main__.py` around lines 116 - 123, The parsed YAML payload in
data (from yaml.safe_load(data_yaml)) can be scalar/None and leads to
AttributeError when accessing data["skill"]; update the code to validate that
data is a dict and contains a non-empty "skill" string (e.g., check
isinstance(data, dict) and "skill" in data) and fail fast with a clear error if
not. For the SKILL.md lookup, change the skill discovery logic that uses
skill_hash/skill_path so it only computes hashlib.md5 when SKILL.md exists and
instead of leaving skill_hash as an empty string, raise an explicit error if no
SKILL.md is found for the given skill name (referencing variables skill_hash,
skill_path, and the skill value) so invalid eval metadata is never written.
agent_eval/state.py-50-51 (1)

50-51: ⚠️ Potential issue | 🟠 Major

Guard positional arguments before indexing sys.argv.

Line 51, Line 58, Line 68, Line 80, and Line 86 assume sys.argv[2] exists; malformed invocations crash with IndexError instead of emitting a controlled CLI error.

Proposed fix
+def _require_args(min_len: int, usage: str) -> None:
+    if len(sys.argv) < min_len:
+        print(f"Usage: {usage}", file=sys.stderr)
+        sys.exit(1)
+
 def main():
@@
     if cmd == "init":
+        _require_args(3, "state.py init <path> [key=value ...]")
         path = Path(sys.argv[2])
@@
     elif cmd == "set":
+        _require_args(3, "state.py set <path> [key=value ...]")
         path = Path(sys.argv[2])
@@
     elif cmd == "read":
+        _require_args(3, "state.py read <path>")
         path = sys.argv[2]
@@
     elif cmd == "write-ids":
+        _require_args(3, "state.py write-ids <path> [ID ...]")
         path = Path(sys.argv[2])
@@
     elif cmd == "read-ids":
+        _require_args(3, "state.py read-ids <path>")
         path = Path(sys.argv[2])

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 57-58, 67-68, 79-80, 85-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 50 - 51, The code indexes sys.argv[2]
directly in the command-handling branches (e.g., when cmd == "init" calling
Path(sys.argv[2])) which crashes on malformed invocations; add guards that
verify enough positional args before indexing (for example check len(sys.argv) >
2 or use argparse to parse required positional arguments) and emit a clear CLI
error/usage message instead of letting IndexError propagate; update every branch
that reads sys.argv[2] (the branches handling "init" and the other commands
referencing sys.argv[2]) to perform the guard and return a controlled error/exit
on missing args.
agent_eval/mlflow/experiment.py-53-56 (1)

53-56: ⚠️ Potential issue | 🟠 Major

Add a timeout to the MLflow subprocess invocation.

Line 53 can block forever if mlflow hangs, stalling the eval pipeline and causing cascading failures in orchestration.

Proposed fix
-    result = subprocess.run(cmd, capture_output=True, text=True)
+    try:
+        result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
+    except subprocess.TimeoutExpired as exc:
+        print(f"MLflow autolog timed out: {exc}", file=sys.stderr)
+        return False

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 53 - 56, The subprocess call
that runs the MLflow setup (the line using result = subprocess.run(cmd,
capture_output=True, text=True)) can hang; add a timeout argument (e.g.
timeout=some_seconds) to subprocess.run and wrap the call in a try/except for
subprocess.TimeoutExpired to handle timeouts: on timeout, print or log an error
including the timeout duration and return False, and keep the existing stderr
handling for non-zero return codes so the function returns True only on
successful completion.
skills/eval-analyze/scripts/discover.py-106-106 (1)

106-106: ⚠️ Potential issue | 🟠 Major

Replace MD5 with SHA-256 for file freshness hashing (CWE-327).

MD5 is cryptographically broken with practical collision attacks feasible in seconds to minutes on modern hardware. An attacker can craft two different skill files with identical MD5 hashes, allowing malicious skill content to bypass freshness detection. This violates NIST and IETF (RFC 6151) guidance against using MD5 where collision resistance is required—which applies to file integrity verification against tampering. Recent exploitation: CVE-2024-3596 (Blast-RADIUS) demonstrated real-world MD5 collision exploits in production systems.

Fix
-            current_hash = hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]
+            current_hash = hashlib.sha256(skill_path.read_bytes()).hexdigest()[:12]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/discover.py` at line 106, The line computing
current_hash uses hashlib.md5, which is vulnerable; replace MD5 with a SHA-256
digest by using hashlib.sha256 on the file bytes and take an appropriate prefix
(or full hex) for freshness checks. Update the expression that sets current_hash
(currently using hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]) to use
hashlib.sha256(skill_path.read_bytes()).hexdigest() (or a safer-length slice) so
the freshness comparison continues to work but with collision-resistant hashing.
agent_eval/config.py-124-127 (1)

124-127: ⚠️ Potential issue | 🟠 Major

project_root returns the wrong directory for non-CWD config loads.

Line 127 returns Path.cwd() but the docstring says “where eval.yaml lives.” This breaks relative path resolution whenever --config points outside the current working directory.

Remediation patch
     `@property`
     def project_root(self) -> Path:
         """Project root (where eval.yaml lives)."""
-        return Path.cwd()
+        return self.config_dir
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/config.py` around lines 124 - 127, The project_root property
currently always returns Path.cwd(), which is wrong when the config was loaded
from a file outside CWD; change project_root to derive the directory from the
loaded config file path (e.g., return Path(self.config_path).parent if
self.config_path or self._config_path exists), falling back to Path.cwd() when
no config path is available; update the property in config.py (project_root) to
use that attribute so relative paths resolve relative to where eval.yaml
actually lives.
skills/eval-run/scripts/workspace.py-111-123 (1)

111-123: ⚠️ Potential issue | 🟠 Major

Handle YAML/JSON parse errors per case instead of aborting the run.

Line 113-120 can raise parser exceptions and terminate workspace.py on one malformed case file.

Remediation patch
 def _read_input(case_dir):
@@
         if name.is_file() and name.suffix in (".yaml", ".yml"):
-            with open(name) as f:
-                data = yaml.safe_load(f)
+            try:
+                with open(name, encoding="utf-8") as f:
+                    data = yaml.safe_load(f)
+            except yaml.YAMLError:
+                continue
             if isinstance(data, dict):
                 return data
         elif name.is_file() and name.suffix == ".json":
             import json
-            with open(name) as f:
-                data = json.load(f)
+            try:
+                with open(name, encoding="utf-8") as f:
+                    data = json.load(f)
+            except json.JSONDecodeError:
+                continue
             if isinstance(data, dict):
                 return data

As per coding guidelines, **: REVIEW PRIORITIES: ... 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 111 - 123, The loop that
reads case files (the for name in sorted(case_dir.iterdir()) block) can raise
YAML/JSON parser exceptions and abort the run; wrap each file's parsing in a
try/except that catches yaml.YAMLError, json.JSONDecodeError (and a broad
Exception fallback), log a warning including the filename (name) and error, and
continue to the next file instead of letting the exception propagate; only
return data when isinstance(data, dict) and skip non-dict or failed parses.
agent_eval/config.py-81-117 (1)

81-117: ⚠️ Potential issue | 🟠 Major

Harden YAML shape validation to avoid runtime crashes.

Line 85/98/105 assumes mapping/list item types. Malformed YAML (e.g., scalar outputs, non-dict judge entries) will crash with AttributeError instead of a controlled config error.

Remediation patch
-        with open(path) as f:
+        with open(path, encoding="utf-8") as f:
             raw = yaml.safe_load(f) or {}
+        if not isinstance(raw, dict):
+            raise ValueError("Config root must be a mapping/object")
@@
-        dataset = raw.get("dataset", {})
+        dataset = raw.get("dataset", {})
+        if not isinstance(dataset, dict):
+            raise ValueError("'dataset' must be a mapping/object")
@@
-        for o in raw.get("outputs", []):
+        outputs = raw.get("outputs", [])
+        if not isinstance(outputs, list):
+            raise ValueError("'outputs' must be a list")
+        for o in outputs:
+            if not isinstance(o, dict):
+                raise ValueError("Each outputs entry must be a mapping/object")
             config.outputs.append(OutputConfig(
@@
-        for j in raw.get("judges", []):
+        judges = raw.get("judges", [])
+        if not isinstance(judges, list):
+            raise ValueError("'judges' must be a list")
+        for j in judges:
+            if not isinstance(j, dict):
+                raise ValueError("Each judges entry must be a mapping/object")
             config.judges.append(JudgeConfig(

As per coding guidelines, **: REVIEW PRIORITIES: ... 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/config.py` around lines 81 - 117, The YAML parsing assumes
specific shapes and will AttributeError on malformed values; validate types
before iterating or accessing mapping keys: ensure dataset = raw.get("dataset",
{}) is a dict (otherwise raise a clear ConfigError/ValueError), ensure
raw.get("outputs", []) is a list and each output item is a dict before creating
OutputConfig in the outputs loop, and ensure raw.get("judges", []) is a list and
each judge item is a dict before creating JudgeConfig in the judges loop; add
concise error messages that include the top-level config name (from cls(...)) or
path to help debugging and bail early instead of letting AttributeError
propagate.
skills/eval-run/scripts/collect.py-43-45 (1)

43-45: ⚠️ Potential issue | 🟠 Major

Validate case_order.yaml shape before use.

Line 44 can load None/scalar/object; later len(case_order) and index access then fail unpredictably.

Remediation patch
     with open(order_path) as f:
         case_order = yaml.safe_load(f)
+    if not isinstance(case_order, list):
+        print("ERROR: case_order.yaml must contain a list", file=sys.stderr)
+        sys.exit(1)

As per coding guidelines, **: REVIEW PRIORITIES: ... 3. Bug-prone patterns and error handling gaps.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` around lines 43 - 45, The YAML load of
case_order (from order_path) can return None or a non-list which later causes
len(case_order) and index access to fail; update the code in collect.py after
the yaml.safe_load call to validate that case_order is a list (e.g.,
isinstance(case_order, list)) and raise a clear ValueError or exit with an
explanatory message including order_path when it is None or not a list so
downstream uses of case_order (len(case_order), indexing) are safe.
skills/eval-run/SKILL.md-23-27 (1)

23-27: ⚠️ Potential issue | 🟠 Major

Respect parsed config path; avoid hardcoded eval.yaml.

Line 26 and Line 44 bypass the user-supplied config path, so runs can validate one file and execute another.

Proposed fix
-Check if eval.yaml exists:
+Check if <config> exists:
@@
-test -f eval.yaml && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
+test -f <config> && echo "CONFIG_EXISTS" || echo "NO_CONFIG"
@@
-Read `dataset.path` from eval.yaml, verify the directory exists and contains case subdirectories.
+Read `dataset.path` from <config>, verify the directory exists and contains case subdirectories.

As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 44-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/SKILL.md` around lines 23 - 27, The check currently hardcodes
"eval.yaml" (seen on the test command lines around line 26 and 44) which ignores
the user-supplied config path; update the commands to use the parsed config path
variable instead of the literal "eval.yaml" (e.g., replace occurrences of
"eval.yaml" with the variable that holds the supplied config path such as
config_path/CONFIG_PATH/parsedConfigPath in the script or template), ensuring
both the existence check and any subsequent uses reference that variable so
validation and execution use the same file.
skills/eval-optimize/SKILL.md-42-43 (1)

42-43: ⚠️ Potential issue | 🟠 Major

Remove unsupported --score from /eval-run invocations.

The workflow calls /eval-run with a flag that is not defined in the eval-run skill interface, so this step is brittle/failing by construction.

Proposed fix
-Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N> --score
+Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N>
@@
-Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N+1> --score --baseline <id>-iter-<N>
+Use the Skill tool to invoke /eval-run --config <config> --model <model> --run-id <id>-iter-<N+1> --baseline <id>-iter-<N>

As per coding guidelines, "**: REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

Also applies to: 60-61

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-optimize/SKILL.md` around lines 42 - 43, The /eval-run example
invocation in SKILL.md includes an unsupported --score flag (e.g., "Use the
Skill tool to invoke /eval-run --config <config> --model <model> --run-id
<id>-iter-<N> --score"); remove the --score flag from that command and any other
occurrences (the other instance noted around the same block) so the listed Skill
invocation matches the eval-run skill interface, and update the example text to
use "/eval-run --config <config> --model <model> --run-id <id>-iter-<N>"
instead.
skills/eval-run/scripts/execute.py-45-45 (1)

45-45: ⚠️ Potential issue | 🟠 Major

High: --disallowed-tools is silently ignored, enabling policy bypass (CWE-693).

At Line 45 the flag is accepted, but it is never enforced before Line 71-Line 79 execution. Exploit scenario: an operator believes risky tools are blocked, but the run executes without those restrictions.

Proposed fix
@@
     parser.add_argument("--disallowed-tools", default=None)
     args = parser.parse_args()
+
+    if args.disallowed_tools:
+        parser.error(
+            "--disallowed-tools is not enforced by execute.py yet; refusing insecure no-op"
+        )

As per coding guidelines, "**: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code)".

Also applies to: 71-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/execute.py` at line 45, The CLI flag
"--disallowed-tools" is parsed (parser.add_argument("--disallowed-tools")) but
never enforced, allowing policy bypass; parse args.disallowed_tools into a
normalized list (split comma/whitespace), validate entries, and before the
execution block that runs tools (the code around the current Line 71-79
execution), check and either remove those tools from the tool list or abort with
an error if a disallowed tool is requested; update any call sites that invoke
tool execution (e.g., the function or block that uses parsed args to select
tools) to consult args.disallowed_tools and enforce the restriction.
agent_eval/agent/base.py-35-45 (1)

35-45: ⚠️ Potential issue | 🟠 Major

Use structured argv instead of raw args: str to reduce injection risk (CWE-88).

A raw string argument encourages downstream shell-style composition. If any runner implementation interpolates this into a command, crafted input can alter execution semantics.

Proposed refactor
-from typing import Optional
+from typing import Optional, Sequence
@@
-        args: str,
+        args: Sequence[str],

As per coding guidelines, "**: REVIEW PRIORITIES: 1. Security vulnerabilities ..." and "2. Architectural issues and anti-patterns".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/base.py` around lines 35 - 45, The run_skill signature
currently accepts a raw string args which risks command injection; change the
parameter to a structured sequence (e.g., argv: Sequence[str] or List[str]) in
the run_skill method and any overrides/implementations so callers pass
individual argument tokens, update all call sites to supply a list of args
instead of a joined string, and ensure any runner logic (e.g., subprocess
invocations) uses the argv list directly (no shell=True or string concatenation)
and documents the new signature in RunResult-related usage.
skills/eval-setup/scripts/check_env.py-64-74 (1)

64-74: ⚠️ Potential issue | 🟠 Major

Don't report MLflow as OK when it was never validated.

A non-empty MLFLOW_TRACKING_URI is accepted without any connectivity check, and the unset branch still appends ("mlflow_server", True, ...). That turns typos, dead remote URIs, and the no-server case into a green preflight result, which then lets /eval-setup skip actual tracking configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/scripts/check_env.py` around lines 64 - 74, The code
currently treats any non-empty MLFLOW_TRACKING_URI as "OK" without validating
connectivity; update the logic so that when MLFLOW_TRACKING_URI is present you
still call _check_mlflow_server (or otherwise validate the URI) to determine
mlflow_ok and set mlflow_detail based on the result (e.g., actual URI on success
or an error/reachability message on failure); also change the checks.append call
to use the computed mlflow_ok (not the hardcoded True) and include the
validation result in the message so typos or unreachable URIs are reported
correctly.
skills/eval-mlflow/SKILL.md-86-99 (1)

86-99: ⚠️ Potential issue | 🟠 Major

Persist and query a stable harness run identifier before touching MLflow.

This flow always starts a fresh MLflow run, searches traces only by experiment, and caps the trace search at 100 results. Reruns will duplicate MLflow runs, and feedback attachment can hit the wrong traces or miss later ones in larger evals. Tag runs and traces with the harness run_id/case_id, reuse the existing run when present, and paginate/filter the trace query.

Also applies to: 115-123, 139-140

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-mlflow/SKILL.md` around lines 86 - 99, Before calling
mlflow.start_run, persist and look up the harness run identifier (run_id and
case_id) and use it to find an existing MLflow run (querying by tag on the
harness run_id and using paginated/filtering trace results rather than a fixed
100 limit); if a matching run exists reuse it with
mlflow.start_run(run_id=existing_run_id) otherwise start a new run and
immediately mlflow.set_tag('harness_run_id', run_id) and tag each trace with
'harness_case_id' where appropriate; ensure all
mlflow.log_param/mlflow.log_metric/mlflow.log_artifact calls reference that
stable run and remove hard-coded placeholders so reruns attach to the correct
existing run and traces are discovered via paginated/filtered queries.
skills/eval-setup/SKILL.md-16-18 (1)

16-18: ⚠️ Potential issue | 🟠 Major

Don't convert an invalid setup state into a false "ready".

check_env.py does not validate the tracing hook, so the Step 1 shortcut can skip Steps 3-5 while .claude/settings.json is still missing. These later snippets also suppress EvalConfig.from_yaml() failures and retry without --config, which turns a malformed eval.yaml into a green no-config result. Fail fast on invalid config, and only short-circuit once tracing is already configured.

Also applies to: 97-107, 114-115

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/SKILL.md` around lines 16 - 18, check_env.py currently
short-circuits to "ready" without verifying the tracing hook or valid config;
update the readiness shortcut so it only returns success after confirming
tracing is configured (presence and validity of .claude/settings.json or the
tracing hook) and do not swallow EvalConfig.from_yaml() exceptions — let
from_yaml raise and fail fast instead of falling back to retry without --config.
Specifically: in check_env.py ensure the Step 1 fast-path validates tracing
setup (the tracing hook check and .claude/settings.json) before skipping Steps
3–5, and in the code paths that call EvalConfig.from_yaml() do not catch/ignore
parsing errors and retry silently with no --config; propagate the error to
surface malformed eval.yaml and treat it as a non-ready state.
skills/eval-mlflow/SKILL.md-76-97 (1)

76-97: ⚠️ Potential issue | 🟠 Major

regressions_detected is not available in summary.yaml.

skills/eval-run/scripts/score.py only merges run_id, judges, per_case, and pairwise into the summary; regression detection is printed to stdout and never persisted. This step needs to recompute regressions from config.thresholds or have the scorer write them explicitly before it can set this tag correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-mlflow/SKILL.md` around lines 76 - 97, The MLflow logging step
cannot set the 'regressions_detected' tag because summary.yaml lacks that field;
update either the scorer or the MLflow step: modify
skills/eval-run/scripts/score.py (the function that builds the summary) to
compute regressions by comparing each judge's metric(s) against
config.thresholds and persist a boolean/enum 'regressions_detected' into the
summary before writing summary.yaml, or keep score.py as-is and recompute
regressions in SKILL.md's MLflow logging block by loading config.thresholds and
the judges' mean/pass_rate from summary.yaml and deriving the same boolean to
pass to mlflow.set_tag('regressions_detected', ...). Ensure you reference the
same judge metric names used in the summary and config.thresholds so the
computed tag matches the scorer's logic.
skills/eval-setup/SKILL.md-78-82 (1)

78-82: ⚠️ Potential issue | 🟠 Major

[Major][CWE-94] Stop embedding shell-expanded values inside python3 -c.

$(pwd) and MLFLOW_TRACKING_URI are injected into single-quoted Python literals here. A quote or newline in either value breaks the command, and a crafted tracking URI can execute arbitrary Python on the operator machine. Read both values from Python instead.

Suggested fix
-```bash
-python3 -c "
-from agent_eval.mlflow.experiment import setup_autolog
-setup_autolog('$(pwd)', tracking_uri='${MLFLOW_TRACKING_URI:-http://127.0.0.1:5000}')
-"
-```
+```bash
+python3 - <<'PY'
+import os
+from agent_eval.mlflow.experiment import setup_autolog
+
+setup_autolog(
+    os.getcwd(),
+    tracking_uri=os.environ.get("MLFLOW_TRACKING_URI", "http://127.0.0.1:5000"),
+)
+PY
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/SKILL.md` around lines 78 - 82, The current shell snippet
injects $(pwd) and MLFLOW_TRACKING_URI into a single-quoted Python literal which
is unsafe; change the invocation to run a Python script that reads values inside
Python (use os.getcwd() for the working dir and
os.environ.get("MLFLOW_TRACKING_URI", "http://127.0.0.1:5000") for the tracking
URI) and then call setup_autolog from agent_eval.mlflow.experiment with those
values (refer to setup_autolog to locate the call); use a here-doc or a small
Python file instead of python3 -c with shell-expanded values.
skills/eval-run/scripts/score.py-538-556 (1)

538-556: ⚠️ Potential issue | 🟠 Major

Handle missing thresholds before calling detect_regressions().

cmd_regression() passes config.thresholds straight through. When eval.yaml omits thresholds, this crashes with 'NoneType' object has no attribute 'items' instead of returning a clean "no thresholds configured" result.

Suggested fix
 def cmd_regression(args):
     config = EvalConfig.from_yaml(args.config)
+    if not config.thresholds:
+        print("No thresholds configured.")
+        return
     summary_path = RUNS_DIR / args.run_id / "summary.yaml"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 538 - 556, cmd_regression
currently passes config.thresholds directly to detect_regressions which breaks
when thresholds is None; update cmd_regression (after config =
EvalConfig.from_yaml(...)) to check if config.thresholds is falsy and either set
thresholds = {} before calling detect_regressions or short-circuit with a clear
message/exit indicating "no thresholds configured" so detect_regressions is
never invoked with None; reference config.thresholds, cmd_regression, and
detect_regressions when making the change.
agent_eval/agent/claude_code.py-79-111 (1)

79-111: ⚠️ Potential issue | 🟠 Major

Subprocess deadlock risk and timeout not enforced on process lifetime.

This pattern creates two critical issues:

  1. Stderr deadlock (CWE-667): stdout is read in a blocking loop while stderr is drained only after the loop completes. If the subprocess fills the stderr buffer (default ~64KB on Linux) before closing stdout, it blocks writing while the parent is blocked reading stdout—classic deadlock.

  2. Timeout ineffective: The hard-coded wait(timeout=30) applies only after stdout iteration finishes. If stdout flow stops while the process is hung, the loop blocks indefinitely. The timeout_s parameter is never applied to the process itself.

Fix: Drain stdout and stderr concurrently using threading or concurrent.futures, and apply timeout_s to proc.wait().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 79 - 111, The current
subprocess handling reads proc.stdout in a blocking loop and only reads
proc.stderr after, which risks deadlock and never applies the provided timeout_s
to the process lifetime; change the logic around the subprocess.Popen(...) call
and the subsequent I/O so that stdout and stderr are drained concurrently (e.g.,
spawn threads or use concurrent.futures to read proc.stdout and proc.stderr in
parallel or use proc.communicate with a timeout), ensure proc.wait or
proc.communicate is called with the timeout_s parameter, and preserve existing
behavior of parsing JSON lines and setting result_obj from proc.stdout
(references: the subprocess.Popen(...) call, proc.stdout loop,
proc.stderr.read(), proc.wait(timeout=30), and the timeout_s parameter).
🟡 Minor comments (4)
agent_eval/state.py-88-88 (1)

88-88: ⚠️ Potential issue | 🟡 Minor

Rename ambiguous loop variable l.

Line 88 uses l, which is ambiguous and hurts readability. Use line (or similar) to avoid confusion and satisfy linting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` at line 88, Rename the ambiguous list comprehension
variable `l` to a clearer name like `line` in the assignment that builds `ids`
(the expression using path.read_text().splitlines()); update the comprehension
from `[l.strip() for l in ... if l.strip()]` to use `line` so it becomes
`[line.strip() for line in ... if line.strip()]` to improve readability and
satisfy linters.
README.md-7-7 (1)

7-7: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced blocks (MD040).

These fences are missing language tags, which keeps markdownlint failing. Use explicit tags (text, bash, yaml, json) for each block.

Also applies to: 60-60, 68-68, 79-79, 294-294, 303-303, 314-314, 324-324, 330-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 7, The fenced code blocks in README.md are missing
language identifiers and failing markdownlint (MD040); update each
triple-backtick fence at the reported locations to include an explicit language
tag (e.g., ```text, ```bash, ```yaml, ```json) appropriate to the block contents
so markdownlint passes — search for the unnamed fenced code blocks and add the
correct language identifier for each occurrence.
skills/eval-run/SKILL.md-31-31 (1)

31-31: ⚠️ Potential issue | 🟡 Minor

Add language tags to fenced blocks (MD040).

These code fences should declare a language (for example text) to satisfy markdownlint.

Also applies to: 141-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/SKILL.md` at line 31, The markdown file SKILL.md contains
unnamed fenced code blocks that trigger markdownlint MD040; update each fenced
block by appending an appropriate language tag (e.g., ```text, ```bash,
```javascript) to the opening fence so all code fences declare a language and
satisfy the linter.
skills/eval-optimize/SKILL.md-12-12 (1)

12-12: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to fenced blocks (MD040).

Use explicit fence languages (text or bash) for these blocks to clear markdownlint warnings.

Also applies to: 41-41, 59-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-optimize/SKILL.md` at line 12, Replace the bare triple-backtick
fenced code blocks in SKILL.md with language-tagged fences (e.g., change ``` to
```text or ```bash) so markdownlint MD040 is satisfied; specifically update the
fenced blocks at the reported locations (the ``` fence at line 12 and the other
occurrences around lines 41 and 59) to use the appropriate language identifier
for each block.
🧹 Nitpick comments (2)
.gitignore (1)

1-9: Consider adding common Python development artifacts.

Standard development tools produce artifacts that should be ignored: .pytest_cache/, .coverage, .mypy_cache/, .ruff_cache/, .tox/, htmlcov/.

📦 Proposed additions
 # Python
 __pycache__/
 *.py[cod]
 *.egg-info/
 *.egg
 dist/
 build/
 .venv/
 venv/
+.pytest_cache/
+.coverage
+.mypy_cache/
+.ruff_cache/
+.tox/
+htmlcov/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 1 - 9, Update the project .gitignore to include
common Python dev artifact entries that are missing: add .pytest_cache/,
.coverage, .mypy_cache/, .ruff_cache/, .tox/, and htmlcov/ alongside the
existing entries (__pycache__/, *.py[cod], *.egg-info/, *.egg, dist/, build/,
.venv/, venv/); ensure these exact patterns are appended so test, coverage,
type-checker and formatter caches and test reports are ignored by VCS.
agent_eval/agent/__init__.py (1)

10-10: Sort __all__ to satisfy Ruff RUF022.

Line 10 is currently unsorted and will keep lint noise if Ruff is enforced.

Remediation patch
-__all__ = ["EvalRunner", "RunResult", "ClaudeCodeRunner", "RUNNERS"]
+__all__ = ["ClaudeCodeRunner", "EvalRunner", "RUNNERS", "RunResult"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/__init__.py` at line 10, The module-level __all__ list is
unsorted (currently contains "EvalRunner", "RunResult", "ClaudeCodeRunner",
"RUNNERS") which triggers Ruff RUF022; reorder the entries in __all__ to be
alphabetically sorted (e.g., "ClaudeCodeRunner", "EvalRunner", "RUNNERS",
"RunResult") so the symbol list is deterministic and lint-clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Around line 1-25: Update the .gitignore to explicitly exclude common secret
and credential files so keys and envs never get committed: add patterns for .env
and .env.* files, .secrets, credentials.json, secrets.json,
service_account.json, api_key* and *-key.*, *.pem, *.key, .netrc, .aws/
(including credentials), .gcp/, .azure/, .ssh/* (except known_hosts if desired),
and any local IDE/OS-specific secret files; also ensure mlflow.db and mlruns/
remain ignored and add any project-specific secret filenames used by the Claude
evaluation harness (e.g., claude_api_key.txt or eval_secrets/) so these unique
filenames are covered in the .gitignore.

In `@agent_eval/agent/claude_code.py`:
- Around line 24-30: The constructor currently defaults to forwarding the entire
parent environment and only supports a denylist via _env_strip; change this to
build child process env from a minimal allowlist instead of os.environ.copy():
define a small default ALLOWED_ENV_KEYS (e.g., PATH, HOME, LANG or the explicit
vars the harness needs) and construct the environment by selecting only those
keys from os.environ and then applying any explicit opt-ins; update the places
that currently use _env_strip and any other env-building code (the constructor
field _env_strip and the similar env-construction later in the file) to accept
an allowlist/opt-in mechanism and never copy the full os.environ into child
skill runs.

In `@agent_eval/config.py`:
- Around line 98-102: The loop that appends OutputConfig entries from
raw.get("outputs") currently accepts uncontrolled paths (see config.outputs and
OutputConfig construction), enabling path traversal; validate and sanitize each
o.get("path") before appending by (1) rejecting absolute paths and any path
containing parent-directory references like "..", (2) normalizing the path with
os.path.normpath, (3) resolving it against the allowed workspace root (join with
the workspace directory) and ensuring the resulting path stays inside that
workspace using os.path.commonpath, and (4) raise/log an error or skip the
output entry if validation fails so only safe, workspace-contained relative
paths are stored in config.outputs.

In `@skills/eval-run/scripts/collect.py`:
- Around line 51-53: Validate and canonicalize the user-controlled path
fragments before joining: ensure case_id and output_cfg.path (out_path) are not
absolute and do not contain path traversal (..); resolve the joined path (e.g.,
workspace / out_path and workspace / "cases" / case_id) to an absolute path and
verify it is a child of the intended base (workspace.resolve() or
output_root.resolve()) before any exists()/mkdir()/copy operations; reject or
sanitize inputs that start with "/" or contain ".." or produce a resolved path
outside the base, and enforce a whitelist of allowed characters/length for
case_id to prevent tricks — update the logic around variables src_dir, out_path,
and any code that constructs case-specific paths to perform these checks before
proceeding.

In `@skills/eval-run/scripts/score.py`:
- Around line 207-223: Ensure prompt_file and each context path are resolved
under project_root and reject symlinks/escapes: when handling jc.prompt_file
(prompt_path) and each ctx_path (path) call Path.resolve() against
project_root.resolve(), verify the resolved path is inside project_root (use
Path.is_relative_to(project_root.resolve()) or equivalent), reject and raise
ValueError/FileNotFoundError if the resolved path is outside project_root or if
path.is_symlink() is True, and only then read_text(); keep the existing
variables prompt_path, path, jc.prompt_file, jc.context, project_root and kwargs
("name"/"instructions") unchanged.
- Around line 51-63: The loop is vulnerable to path traversal and symlink escape
via config.outputs/ output.path and rglob; before iterating, resolve and
validate artifact_dir against case_dir (use case_dir_resolved =
case_dir.resolve() and artifact_dir_resolved = (case_dir /
out_path).resolve(strict=False)) and skip if artifact_dir_resolved is not inside
case_dir_resolved; then for each discovered file f, skip non-files and any
symlinks (f.is_symlink()) and resolve f (f_resolved = f.resolve(strict=False))
and ensure f_resolved is inside case_dir_resolved (use relative_to or equivalent
check) before reading; only then read_text() or mark as binary, and record using
the path relative to case_dir_resolved.
- Around line 179-186: The _make_inline_check function currently compiles and
execs untrusted jc.check source and must be removed or replaced; instead, change
_make_inline_check to reject inline Python code and accept only trusted
references (e.g., "module:function" strings) from jc.check, validate the format,
whitelist allowed module names, use importlib.import_module to import the module
and getattr to fetch the function (e.g., check_fn = getattr(module, func_name)),
verify the resolved object is callable, and raise a clear error if validation
fails; do not use compile/exec or pass __builtins__ into exec anywhere in
_make_inline_check or related helpers.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 91-99: Reject any absolute or path-traversal symlink entries
before creating links: validate entries from args.symlinks (and resulting
symlink_names) by rejecting strings that are absolute (Path.is_absolute()) or
contain parent segments (".." or normalized path that escapes) and ensure
resolved target path is within project_root (e.g., verify
target.resolve().is_relative_to(project_root) or equivalent check); if
validation fails, skip creating the symlink and emit a clear warning/error; also
validate that the link path will reside inside workspace before calling
link.symlink_to(target.resolve()) to prevent creating links outside allowed
directories.
- Around line 55-58: The code builds a workspace Path with untrusted args.run_id
and blindly rmtree's it (variables: workspace, args.run_id, Path,
shutil.rmtree); to fix, validate/sanitize run_id (allow only a strict safe regex
e.g. alphanumerics, hyphen, underscore) or reject otherwise, derive the
workspace under a fixed canonical base like Path("/tmp/agent-eval").resolve(),
construct the candidate path with base.joinpath(safe_run_id) and call .resolve()
then verify the resolved path is a subpath of the base (reject if not); create
the directory atomically (use tempfile.mkdtemp or Path.mkdir(parents=True,
exist_ok=False) with safe permissions) instead of predicting names, and replace
the unconditional shutil.rmtree with a safe removal that ensures the path is
inside the base (and is not a symlink) before deleting.

---

Major comments:
In `@agent_eval/__main__.py`:
- Around line 116-123: The parsed YAML payload in data (from
yaml.safe_load(data_yaml)) can be scalar/None and leads to AttributeError when
accessing data["skill"]; update the code to validate that data is a dict and
contains a non-empty "skill" string (e.g., check isinstance(data, dict) and
"skill" in data) and fail fast with a clear error if not. For the SKILL.md
lookup, change the skill discovery logic that uses skill_hash/skill_path so it
only computes hashlib.md5 when SKILL.md exists and instead of leaving skill_hash
as an empty string, raise an explicit error if no SKILL.md is found for the
given skill name (referencing variables skill_hash, skill_path, and the skill
value) so invalid eval metadata is never written.

In `@agent_eval/agent/base.py`:
- Around line 35-45: The run_skill signature currently accepts a raw string args
which risks command injection; change the parameter to a structured sequence
(e.g., argv: Sequence[str] or List[str]) in the run_skill method and any
overrides/implementations so callers pass individual argument tokens, update all
call sites to supply a list of args instead of a joined string, and ensure any
runner logic (e.g., subprocess invocations) uses the argv list directly (no
shell=True or string concatenation) and documents the new signature in
RunResult-related usage.

In `@agent_eval/agent/claude_code.py`:
- Around line 79-111: The current subprocess handling reads proc.stdout in a
blocking loop and only reads proc.stderr after, which risks deadlock and never
applies the provided timeout_s to the process lifetime; change the logic around
the subprocess.Popen(...) call and the subsequent I/O so that stdout and stderr
are drained concurrently (e.g., spawn threads or use concurrent.futures to read
proc.stdout and proc.stderr in parallel or use proc.communicate with a timeout),
ensure proc.wait or proc.communicate is called with the timeout_s parameter, and
preserve existing behavior of parsing JSON lines and setting result_obj from
proc.stdout (references: the subprocess.Popen(...) call, proc.stdout loop,
proc.stderr.read(), proc.wait(timeout=30), and the timeout_s parameter).

In `@agent_eval/config.py`:
- Around line 124-127: The project_root property currently always returns
Path.cwd(), which is wrong when the config was loaded from a file outside CWD;
change project_root to derive the directory from the loaded config file path
(e.g., return Path(self.config_path).parent if self.config_path or
self._config_path exists), falling back to Path.cwd() when no config path is
available; update the property in config.py (project_root) to use that attribute
so relative paths resolve relative to where eval.yaml actually lives.
- Around line 81-117: The YAML parsing assumes specific shapes and will
AttributeError on malformed values; validate types before iterating or accessing
mapping keys: ensure dataset = raw.get("dataset", {}) is a dict (otherwise raise
a clear ConfigError/ValueError), ensure raw.get("outputs", []) is a list and
each output item is a dict before creating OutputConfig in the outputs loop, and
ensure raw.get("judges", []) is a list and each judge item is a dict before
creating JudgeConfig in the judges loop; add concise error messages that include
the top-level config name (from cls(...)) or path to help debugging and bail
early instead of letting AttributeError propagate.

In `@agent_eval/mlflow/experiment.py`:
- Around line 53-56: The subprocess call that runs the MLflow setup (the line
using result = subprocess.run(cmd, capture_output=True, text=True)) can hang;
add a timeout argument (e.g. timeout=some_seconds) to subprocess.run and wrap
the call in a try/except for subprocess.TimeoutExpired to handle timeouts: on
timeout, print or log an error including the timeout duration and return False,
and keep the existing stderr handling for non-zero return codes so the function
returns True only on successful completion.

In `@agent_eval/state.py`:
- Around line 50-51: The code indexes sys.argv[2] directly in the
command-handling branches (e.g., when cmd == "init" calling Path(sys.argv[2]))
which crashes on malformed invocations; add guards that verify enough positional
args before indexing (for example check len(sys.argv) > 2 or use argparse to
parse required positional arguments) and emit a clear CLI error/usage message
instead of letting IndexError propagate; update every branch that reads
sys.argv[2] (the branches handling "init" and the other commands referencing
sys.argv[2]) to perform the guard and return a controlled error/exit on missing
args.

In `@skills/eval-analyze/scripts/discover.py`:
- Line 106: The line computing current_hash uses hashlib.md5, which is
vulnerable; replace MD5 with a SHA-256 digest by using hashlib.sha256 on the
file bytes and take an appropriate prefix (or full hex) for freshness checks.
Update the expression that sets current_hash (currently using
hashlib.md5(skill_path.read_bytes()).hexdigest()[:12]) to use
hashlib.sha256(skill_path.read_bytes()).hexdigest() (or a safer-length slice) so
the freshness comparison continues to work but with collision-resistant hashing.

In `@skills/eval-mlflow/SKILL.md`:
- Around line 86-99: Before calling mlflow.start_run, persist and look up the
harness run identifier (run_id and case_id) and use it to find an existing
MLflow run (querying by tag on the harness run_id and using paginated/filtering
trace results rather than a fixed 100 limit); if a matching run exists reuse it
with mlflow.start_run(run_id=existing_run_id) otherwise start a new run and
immediately mlflow.set_tag('harness_run_id', run_id) and tag each trace with
'harness_case_id' where appropriate; ensure all
mlflow.log_param/mlflow.log_metric/mlflow.log_artifact calls reference that
stable run and remove hard-coded placeholders so reruns attach to the correct
existing run and traces are discovered via paginated/filtered queries.
- Around line 76-97: The MLflow logging step cannot set the
'regressions_detected' tag because summary.yaml lacks that field; update either
the scorer or the MLflow step: modify skills/eval-run/scripts/score.py (the
function that builds the summary) to compute regressions by comparing each
judge's metric(s) against config.thresholds and persist a boolean/enum
'regressions_detected' into the summary before writing summary.yaml, or keep
score.py as-is and recompute regressions in SKILL.md's MLflow logging block by
loading config.thresholds and the judges' mean/pass_rate from summary.yaml and
deriving the same boolean to pass to mlflow.set_tag('regressions_detected',
...). Ensure you reference the same judge metric names used in the summary and
config.thresholds so the computed tag matches the scorer's logic.

In `@skills/eval-optimize/SKILL.md`:
- Around line 42-43: The /eval-run example invocation in SKILL.md includes an
unsupported --score flag (e.g., "Use the Skill tool to invoke /eval-run --config
<config> --model <model> --run-id <id>-iter-<N> --score"); remove the --score
flag from that command and any other occurrences (the other instance noted
around the same block) so the listed Skill invocation matches the eval-run skill
interface, and update the example text to use "/eval-run --config <config>
--model <model> --run-id <id>-iter-<N>" instead.

In `@skills/eval-run/scripts/collect.py`:
- Around line 43-45: The YAML load of case_order (from order_path) can return
None or a non-list which later causes len(case_order) and index access to fail;
update the code in collect.py after the yaml.safe_load call to validate that
case_order is a list (e.g., isinstance(case_order, list)) and raise a clear
ValueError or exit with an explanatory message including order_path when it is
None or not a list so downstream uses of case_order (len(case_order), indexing)
are safe.

In `@skills/eval-run/scripts/execute.py`:
- Line 45: The CLI flag "--disallowed-tools" is parsed
(parser.add_argument("--disallowed-tools")) but never enforced, allowing policy
bypass; parse args.disallowed_tools into a normalized list (split
comma/whitespace), validate entries, and before the execution block that runs
tools (the code around the current Line 71-79 execution), check and either
remove those tools from the tool list or abort with an error if a disallowed
tool is requested; update any call sites that invoke tool execution (e.g., the
function or block that uses parsed args to select tools) to consult
args.disallowed_tools and enforce the restriction.

In `@skills/eval-run/scripts/score.py`:
- Around line 538-556: cmd_regression currently passes config.thresholds
directly to detect_regressions which breaks when thresholds is None; update
cmd_regression (after config = EvalConfig.from_yaml(...)) to check if
config.thresholds is falsy and either set thresholds = {} before calling
detect_regressions or short-circuit with a clear message/exit indicating "no
thresholds configured" so detect_regressions is never invoked with None;
reference config.thresholds, cmd_regression, and detect_regressions when making
the change.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 111-123: The loop that reads case files (the for name in
sorted(case_dir.iterdir()) block) can raise YAML/JSON parser exceptions and
abort the run; wrap each file's parsing in a try/except that catches
yaml.YAMLError, json.JSONDecodeError (and a broad Exception fallback), log a
warning including the filename (name) and error, and continue to the next file
instead of letting the exception propagate; only return data when
isinstance(data, dict) and skip non-dict or failed parses.

In `@skills/eval-run/SKILL.md`:
- Around line 23-27: The check currently hardcodes "eval.yaml" (seen on the test
command lines around line 26 and 44) which ignores the user-supplied config
path; update the commands to use the parsed config path variable instead of the
literal "eval.yaml" (e.g., replace occurrences of "eval.yaml" with the variable
that holds the supplied config path such as
config_path/CONFIG_PATH/parsedConfigPath in the script or template), ensuring
both the existence check and any subsequent uses reference that variable so
validation and execution use the same file.

In `@skills/eval-setup/scripts/check_env.py`:
- Around line 64-74: The code currently treats any non-empty MLFLOW_TRACKING_URI
as "OK" without validating connectivity; update the logic so that when
MLFLOW_TRACKING_URI is present you still call _check_mlflow_server (or otherwise
validate the URI) to determine mlflow_ok and set mlflow_detail based on the
result (e.g., actual URI on success or an error/reachability message on
failure); also change the checks.append call to use the computed mlflow_ok (not
the hardcoded True) and include the validation result in the message so typos or
unreachable URIs are reported correctly.

In `@skills/eval-setup/SKILL.md`:
- Around line 16-18: check_env.py currently short-circuits to "ready" without
verifying the tracing hook or valid config; update the readiness shortcut so it
only returns success after confirming tracing is configured (presence and
validity of .claude/settings.json or the tracing hook) and do not swallow
EvalConfig.from_yaml() exceptions — let from_yaml raise and fail fast instead of
falling back to retry without --config. Specifically: in check_env.py ensure the
Step 1 fast-path validates tracing setup (the tracing hook check and
.claude/settings.json) before skipping Steps 3–5, and in the code paths that
call EvalConfig.from_yaml() do not catch/ignore parsing errors and retry
silently with no --config; propagate the error to surface malformed eval.yaml
and treat it as a non-ready state.
- Around line 78-82: The current shell snippet injects $(pwd) and
MLFLOW_TRACKING_URI into a single-quoted Python literal which is unsafe; change
the invocation to run a Python script that reads values inside Python (use
os.getcwd() for the working dir and os.environ.get("MLFLOW_TRACKING_URI",
"http://127.0.0.1:5000") for the tracking URI) and then call setup_autolog from
agent_eval.mlflow.experiment with those values (refer to setup_autolog to locate
the call); use a here-doc or a small Python file instead of python3 -c with
shell-expanded values.

---

Minor comments:
In `@agent_eval/state.py`:
- Line 88: Rename the ambiguous list comprehension variable `l` to a clearer
name like `line` in the assignment that builds `ids` (the expression using
path.read_text().splitlines()); update the comprehension from `[l.strip() for l
in ... if l.strip()]` to use `line` so it becomes `[line.strip() for line in ...
if line.strip()]` to improve readability and satisfy linters.

In `@README.md`:
- Line 7: The fenced code blocks in README.md are missing language identifiers
and failing markdownlint (MD040); update each triple-backtick fence at the
reported locations to include an explicit language tag (e.g., ```text, ```bash,
```yaml, ```json) appropriate to the block contents so markdownlint passes —
search for the unnamed fenced code blocks and add the correct language
identifier for each occurrence.

In `@skills/eval-optimize/SKILL.md`:
- Line 12: Replace the bare triple-backtick fenced code blocks in SKILL.md with
language-tagged fences (e.g., change ``` to ```text or ```bash) so markdownlint
MD040 is satisfied; specifically update the fenced blocks at the reported
locations (the ``` fence at line 12 and the other occurrences around lines 41
and 59) to use the appropriate language identifier for each block.

In `@skills/eval-run/SKILL.md`:
- Line 31: The markdown file SKILL.md contains unnamed fenced code blocks that
trigger markdownlint MD040; update each fenced block by appending an appropriate
language tag (e.g., ```text, ```bash, ```javascript) to the opening fence so all
code fences declare a language and satisfy the linter.

---

Nitpick comments:
In @.gitignore:
- Around line 1-9: Update the project .gitignore to include common Python dev
artifact entries that are missing: add .pytest_cache/, .coverage, .mypy_cache/,
.ruff_cache/, .tox/, and htmlcov/ alongside the existing entries (__pycache__/,
*.py[cod], *.egg-info/, *.egg, dist/, build/, .venv/, venv/); ensure these exact
patterns are appended so test, coverage, type-checker and formatter caches and
test reports are ignored by VCS.

In `@agent_eval/agent/__init__.py`:
- Line 10: The module-level __all__ list is unsorted (currently contains
"EvalRunner", "RunResult", "ClaudeCodeRunner", "RUNNERS") which triggers Ruff
RUF022; reorder the entries in __all__ to be alphabetically sorted (e.g.,
"ClaudeCodeRunner", "EvalRunner", "RUNNERS", "RunResult") so the symbol list is
deterministic and lint-clean.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9269d9b3-c24c-4275-90ed-df28d5bd0b14

📥 Commits

Reviewing files that changed from the base of the PR and between 90f5996 and bc48951.

📒 Files selected for processing (30)
  • .claude-plugin/plugin.json
  • .gitignore
  • CLAUDE.md
  • README.md
  • agent_eval/__init__.py
  • agent_eval/__main__.py
  • agent_eval/agent/__init__.py
  • agent_eval/agent/base.py
  • agent_eval/agent/claude_code.py
  • agent_eval/config.py
  • agent_eval/mlflow/__init__.py
  • agent_eval/mlflow/experiment.py
  • agent_eval/state.py
  • eval.yaml
  • pyproject.toml
  • skills/eval-analyze/SKILL.md
  • skills/eval-analyze/prompts/analyze-skill.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • skills/eval-analyze/scripts/discover.py
  • skills/eval-mlflow/SKILL.md
  • skills/eval-optimize/SKILL.md
  • skills/eval-run/SKILL.md
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-run/scripts/collect.py
  • skills/eval-run/scripts/execute.py
  • skills/eval-run/scripts/score.py
  • skills/eval-run/scripts/workspace.py
  • skills/eval-setup/SKILL.md
  • skills/eval-setup/scripts/check_env.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

♻️ Duplicate comments (5)
skills/eval-run/scripts/workspace.py (2)

91-99: ⚠️ Potential issue | 🔴 Critical

Critical: unvalidated symlink entries allow workspace/path escape (CWE-22, CWE-73).

Lines 91-99 accept raw --symlinks entries and use them directly for both source and link paths. Exploit scenario: entries like ../../outside or absolute paths can create links outside the workspace or expose unintended host files.

Remediation patch
+def _validate_relative_entry(name: str) -> str:
+    p = Path(name)
+    if p.is_absolute() or ".." in p.parts:
+        raise ValueError(f"Invalid symlink entry: {name}")
+    return name
@@
     for name in symlink_names:
-        target = project_root / name
-        link = workspace / name
+        try:
+            safe_name = _validate_relative_entry(name)
+        except ValueError as e:
+            print(f"ERROR: {e}", file=sys.stderr)
+            sys.exit(1)
+
+        target = (project_root / safe_name).resolve()
+        link = (workspace / safe_name).resolve()
+        if project_root.resolve() not in target.parents and target != project_root.resolve():
+            print(f"ERROR: target escapes project root: {safe_name}", file=sys.stderr)
+            sys.exit(1)
+        if workspace.resolve() not in link.parents and link != workspace.resolve():
+            print(f"ERROR: link escapes workspace: {safe_name}", file=sys.stderr)
+            sys.exit(1)
         if target.exists():
+            link.parent.mkdir(parents=True, exist_ok=True)
             link.symlink_to(target.resolve())

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 91 - 99, The code creates
symlinks from raw args.symlinks without validating paths, allowing path
traversal or absolute paths to escape the workspace; update the logic that
builds symlink_names and creates target/link (referencing args.symlinks,
symlink_names, target, link, project_root, workspace, and link.symlink_to) to
validate each entry: reject or normalize entries containing absolute paths or
parent-traversal segments (".." or starting with "/"), resolve the candidate
target with project_root.resolve() and ensure
target.resolve().is_relative_to(project_root) (or catch ValueError from
relative_to) and similarly ensure the final link would reside under workspace
before calling link.symlink_to; log or raise and skip invalid entries.

55-58: ⚠️ Potential issue | 🔴 Critical

Critical: path traversal + unsafe recursive delete in workspace setup (CWE-22, CWE-377, CWE-59).

Line 55 interpolates untrusted --run-id into a filesystem path, and Line 57 recursively deletes it. Exploit scenario: --run-id ../../target (or symlink tricks) can delete paths outside the intended temp root.

Remediation patch
 import argparse
+import re
 import shutil
 import sys
+import tempfile
 from pathlib import Path
@@
-    workspace = Path(f"/tmp/agent-eval/{args.run_id}")
+    if not re.fullmatch(r"[A-Za-z0-9._-]+", args.run_id):
+        print("ERROR: invalid run-id", file=sys.stderr)
+        sys.exit(1)
+
+    base_dir = (Path(tempfile.gettempdir()) / "agent-eval").resolve()
+    workspace = (base_dir / args.run_id).resolve()
+    if base_dir not in workspace.parents:
+        print("ERROR: run-id escapes base dir", file=sys.stderr)
+        sys.exit(1)
+
     if workspace.exists():
+        if workspace.is_symlink():
+            print("ERROR: refusing to delete symlink workspace", file=sys.stderr)
+            sys.exit(1)
         shutil.rmtree(workspace)
-    workspace.mkdir(parents=True)
+    workspace.mkdir(parents=True, mode=0o700)

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 55 - 58, The current
workspace setup uses args.run_id directly and recursively deletes the resolved
path (workspace), allowing path traversal and unsafe deletes; fix by deriving
the workspace under a fixed base (e.g., base = Path("/tmp/agent-eval")),
sanitize and validate args.run_id (reject path separators, .., empty or absolute
values) or better generate a safe unique name (use tempfile or uuid) instead of
trusting input, resolve and confirm
workspace.resolve().is_relative_to(base.resolve()) before any rmtree, and only
call shutil.rmtree(workspace) when that check passes; update references to
workspace and args.run_id accordingly and ensure directory creation uses
workspace.mkdir(parents=True, exist_ok=False).
skills/eval-run/scripts/score.py (3)

207-223: ⚠️ Potential issue | 🔴 Critical

Critical: judge prompt/context file paths are unconstrained (CWE-22, CWE-73, CWE-200).

Paths from jc.prompt_file and jc.context are read directly. Exploit scenario: ../../.env or absolute paths can exfiltrate host secrets to external judge models.

Remediation patch
+def _resolve_project_file(project_root: Path, raw_path: str) -> Path:
+    p = Path(raw_path)
+    p = p if p.is_absolute() else project_root / p
+    if p.is_symlink():
+        raise ValueError(f"Symlink not allowed: {raw_path}")
+    resolved = p.resolve()
+    root = project_root.resolve()
+    if root not in resolved.parents and resolved != root:
+        raise ValueError(f"Path escapes project root: {raw_path}")
+    return resolved
@@
     if not prompt and jc.prompt_file:
-        prompt_path = Path(jc.prompt_file)
-        if project_root and not prompt_path.is_absolute():
-            prompt_path = project_root / prompt_path
+        prompt_path = _resolve_project_file(project_root or Path.cwd(), jc.prompt_file)
         if not prompt_path.exists():
             raise FileNotFoundError(f"Judge prompt not found: {prompt_path}")
         prompt = prompt_path.read_text()
@@
     for ctx_path in jc.context:
-        path = Path(ctx_path)
-        if project_root and not path.is_absolute():
-            path = project_root / path
+        path = _resolve_project_file(project_root or Path.cwd(), ctx_path)
         if path.exists():
             prompt += f"\n\n## Context: {path.name}\n\n{path.read_text()}"

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 207 - 223, The prompt and
context file handling (variables prompt_path, path; logic around jc.prompt_file
and jc.context) currently reads arbitrary paths; fix by validating and
sandboxing paths before reading: resolve each Path with .resolve() and ensure
the resolved path is inside a permitted base (e.g., project_root.resolve() or a
specific judges directory) — reject absolute or traversing paths that fall
outside that base, raise an error/log and skip or abort; additionally enforce an
allowlist of allowed filename patterns or extensions if appropriate and avoid
reading dotfiles or known secret files. Use the same checks for prompt_path and
each ctx_path and only call read_text() after validation, updating any error
messages to reference jc.name.

179-186: ⚠️ Potential issue | 🔴 Critical

Critical: inline judge code execution enables arbitrary code execution (CWE-94).

Lines 179-186 compile and execute config-supplied Python (jc.check) with builtins available. Exploit scenario: malicious eval config can execute OS commands, read secrets, and tamper scorer outputs.

Remediation patch
 def _make_inline_check(jc):
-    """Create a scorer from an inline check script."""
-    source = jc.check
-    wrapped = f"def _check(outputs):\n{textwrap.indent(source, '    ')}"
-    code = compile(wrapped, f"<check:{jc.name}>", "exec")
-    ns = {"__builtins__": __builtins__}
-    exec(code, ns)
-    check_fn = ns["_check"]
+    """Load a trusted check callable from module:function reference."""
+    ref = jc.check.strip()
+    if ":" not in ref:
+        raise ValueError(f"Judge '{jc.name}' check must be module:function")
+    module_name, fn_name = ref.split(":", 1)
+    mod = importlib.import_module(module_name)
+    check_fn = getattr(mod, fn_name, None)
+    if not callable(check_fn):
+        raise ValueError(f"Judge '{jc.name}' function not callable: {ref}")

As per coding guidelines, **/*.py: No eval() or exec() with untrusted input (CWE-94) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 179 - 186, The current
_make_inline_check compiles and execs untrusted jc.check with full builtins,
enabling arbitrary code execution; replace this by disallowing direct exec of
jc.check and instead run checks in a hardened sandbox (e.g., execute the check
in a separate, isolated process with no inherited secrets, restricted
environment, resource/time limits, and no access to host builtins) or
validate/transform jc.check into a safe subset via AST whitelisting before
compilation; specifically, remove the direct exec(code, ns) flow in
_make_inline_check and either (a) invoke a sandboxed runner that receives the
wrapped function string and returns results, or (b) replace ns with a minimal,
explicit safe API and perform AST checks on jc.check to block dangerous nodes
(Import, Exec, Eval, Attribute access to os/sys, etc.) before creating check_fn.

51-79: ⚠️ Potential issue | 🔴 Critical

Critical: artifact loading allows path/symlink escape and data exfiltration (CWE-22, CWE-59, CWE-200).

output.path is config-controlled and file traversal uses rglob without root containment checks. Exploit scenario: malicious paths/symlinks can pull host files into summary.yaml/downstream logging.

Remediation patch
+def _resolve_under(root: Path, candidate: Path) -> Path:
+    root_resolved = root.resolve()
+    resolved = candidate.resolve()
+    if root_resolved not in resolved.parents and resolved != root_resolved:
+        raise ValueError(f"Path escapes case dir: {candidate}")
+    return resolved
@@
     for output in config.outputs:
         out_path = output.path or "."
-        artifact_dir = case_dir / out_path
+        artifact_dir = _resolve_under(case_dir, case_dir / out_path)
         if not artifact_dir.exists():
             continue
         for f in sorted(artifact_dir.rglob("*")):
-            if not f.is_file():
+            if not f.is_file() or f.is_symlink():
                 continue
-            rel = str(f.relative_to(case_dir))
+            safe_f = _resolve_under(case_dir, f)
+            rel = str(f.relative_to(case_dir))
             try:
-                record["files"][rel] = f.read_text()
+                record["files"][rel] = safe_f.read_text()

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal) and **: REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 51 - 79, The artifact loading
allows untrusted output.path to escape the intended case_dir via path traversal
or symlinks; update the loops that use config.outputs / output.path /
artifact_dir and the rglob/iterdir traversal to validate and constrain file
access: normalize and reject absolute or parent-traversal paths (resolve
output.path against case_dir), compute resolved_artifact_dir = (case_dir /
out_path).resolve() and ensure resolved_artifact_dir is within
case_dir.resolve() (skip and log if not), and when iterating files check each
file's resolved path (f.resolve()) is also within case_dir.resolve() before
reading into record["files"] or setting {key}_content/_file; also avoid
following symlinks if you intend to disallow them (use lstat or skip
f.is_symlink()), and skip any files that fail containment checks to prevent
exfiltration.
🧹 Nitpick comments (5)
agent_eval/mlflow/experiment.py (2)

27-39: Inconsistent health-check endpoint vs. check_env.py.

ensure_server probes /api/2.0/mlflow/experiments/search while skills/eval-setup/scripts/check_env.py:140 uses /health. The /health endpoint is more appropriate for a simple liveness check (lighter, dedicated health endpoint).

 def ensure_server(port: int = 5000) -> bool:
-    """Check if MLflow server is running, optionally start it.
+    """Check if MLflow server is running.
 
     Returns:
         True if server is available.
     """
     import urllib.request
     try:
-        urllib.request.urlopen(f"http://127.0.0.1:{port}/api/2.0/mlflow/experiments/search",
+        urllib.request.urlopen(f"http://127.0.0.1:{port}/health",
                               timeout=2)
         return True
     except Exception:
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 27 - 39, The ensure_server
function is using the heavyweight experiments/search endpoint for health checks;
change it to probe the lightweight /health endpoint instead by calling
urllib.request.urlopen against f"http://127.0.0.1:{port}/health" (keep the
existing timeout and try/except semantics), i.e., update the URL used in
ensure_server so the health probe matches the /health check used in check_env.py
and still returns True on success and False on exception.

72-74: Silent exception swallowing loses diagnostics.

log_feedback catches and discards all exceptions. A trace-level log would aid debugging without disrupting the caller.

     except Exception:
-        pass
+        pass  # Feedback logging is best-effort; failures are non-fatal

Or better, add optional logging:

except Exception as e:
    import logging
    logging.debug("Failed to log feedback for trace %s: %s", trace_id, e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 72 - 74, In log_feedback, don't
silently swallow all exceptions; replace the bare except/pass with an except
Exception as e that logs the failure including trace_id and the exception
details at debug/trace level (e.g., use logging.getLogger(__name__).debug with a
message like "Failed to log feedback for trace %s: %s" and include exc_info=True
or the exception string) so diagnostics are preserved without changing caller
behavior.
agent_eval/__main__.py (1)

19-25: Mutating sys.argv is fragile.

Slicing sys.argv in-place works but is non-obvious. Consider passing arguments explicitly to subcommand handlers.

     if command == "config":
-        sys.argv = sys.argv[1:]
-        _config_command()
+        _config_command(sys.argv[2:])
     elif command == "state":
-        sys.argv = sys.argv[1:]
         from .state import main as state_main
-        state_main()
+        state_main(sys.argv[2:])

This requires updating _config_command and state.main to accept args.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/__main__.py` around lines 19 - 25, The current code mutates
sys.argv in-place before delegating to subcommands (_config_command and
state.main), which is fragile; instead, change the call sites to pass the slice
explicitly (e.g., args = sys.argv[1:]) and update _config_command and the
state.main function (imported as state_main) to accept an args parameter (e.g.,
def _config_command(args): and def main(args):) and use that passed-in list for
argument parsing; provide a sensible default signature (args=None) if needed for
backwards compatibility and update internal parsing to use the provided args
rather than reading sys.argv directly.
agent_eval/state.py (1)

85-89: Rename ambiguous loop variable l to line.

Static analysis flags l as ambiguous (easily confused with 1 or I).

     elif cmd == "read-ids":
         path = Path(sys.argv[2])
         if path.exists():
-            ids = [l.strip() for l in path.read_text().splitlines() if l.strip()]
+            ids = [line.strip() for line in path.read_text().splitlines() if line.strip()]
             print(" ".join(ids))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 85 - 89, In the "read-ids" branch (the cmd
== "read-ids" block) the list comprehension that builds ids uses the ambiguous
loop variable named `l`; rename that variable to `line` in the comprehension
(i.e., change `[l.strip() for l in path.read_text().splitlines() if l.strip()]`
to use `line`) so the variable is clearer and avoid confusion with `1`/`I`;
ensure the variable name change is only within the list comprehension that
assigns to `ids`.
agent_eval/agent/claude_code.py (1)

124-125: Clarify the hardcoded 30s timeout in proc.wait().

The function accepts timeout_s (default 600), but proc.wait(timeout=30) uses a hardcoded 30s. This is for draining stderr after stdout completes, not the main execution timeout—consider adding a brief comment to clarify intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 124 - 125, The code uses a
magic literal proc.wait(timeout=30) while the function accepts timeout_s
(default 600); replace the hardcoded 30 with a named drain timeout variable
(e.g., drain_timeout = 30) or derive it from timeout_s (e.g., min(30,
timeout_s)) and add a brief inline comment next to proc.wait explaining this is
a short drain wait to read stderr after stdout completion—not the main process
execution timeout governed by timeout_s; reference the proc.wait call and the
timeout_s parameter to locate and update the behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent_eval/__main__.py`:
- Line 138: The module calls main() unconditionally which breaks importability;
change the unconditional invocation to a guarded entry point by removing the
bare main() call and invoking main() only inside a module guard like if __name__
== "__main__": main() so imports don't execute the main() function when the
module (and symbols like main) are imported for tests or other code.

In `@agent_eval/state.py`:
- Around line 50-55: The code accesses sys.argv[2] without checking length for
the "init" (and similarly "set", "read", "write-ids", "read-ids") commands which
causes IndexError; update the command handling around cmd (the variable) to
validate len(sys.argv) before reading sys.argv[2] (and any other required
positional args), and on missing args call a clear error/usage path (e.g., print
a usage message or raise SystemExit) instead of indexing directly; locate the
blocks using sys.argv[2] and _parse_kwargs and add the bounds checks and
appropriate error handling for each command.

In `@skills/eval-analyze/scripts/discover.py`:
- Line 106: The freshness check uses a weak collision-prone MD5 hash: replace
the MD5 computation at current_hash =
hashlib.md5(skill_path.read_bytes()).hexdigest()[:12] with a secure hash (e.g.,
hashlib.sha256) and avoid truncating the digest (or at minimum use a much longer
slice) so the check becomes current_hash =
hashlib.sha256(skill_path.read_bytes()).hexdigest() (or a sufficiently long
prefix); update any dependent comparisons that expect the 12-char MD5 prefix
accordingly to use the new SHA-256 value (references: current_hash, skill_path
in discover.py).
- Around line 35-36: The bare "except Exception: pass" in discover.py hides I/O
and frontmatter parse errors; replace it with explicit exception handling that
catches specific failures (e.g., OSError/IOError, UnicodeDecodeError, and
yaml.YAMLError) and log a warning that includes the file path and the exception
(use the module's logger or stderr) so discovery continues but failures are
visible; do NOT swallow all Exceptions—only handle expected parse/read errors
and let truly unexpected exceptions propagate.

In `@skills/eval-optimize/SKILL.md`:
- Around line 12-14: The fenced code blocks (e.g., the command example
containing "/eval-optimize [config_file] --model <model> [--max-iterations <N>]"
and the other blocks at the noted locations) are untyped and trigger
markdownlint MD040; update each triple-backtick fence to include an appropriate
language identifier such as text or bash (for example change ``` to ```text or
```bash) so all code fences are explicitly typed (apply the same change to the
blocks around lines 41-43 and 59-61).

In `@skills/eval-run/scripts/execute.py`:
- Line 71: The code dereferences args.settings when computing settings_path
causing an AttributeError because the CLI parser never defines --settings;
either add a proper CLI option (e.g., parser.add_argument('--settings',
help=..., type=str, default=None)) to the argument parser that produces the args
object, or defensively access the attribute (e.g., use getattr(args, 'settings',
None)) before creating settings_path; update the settings_path assignment that
references args.settings accordingly so run_skill() no longer crashes.
- Around line 55-57: The code at the agent resolution (where agent = args.agent
or config.runner and runner_cls = RUNNERS[agent]) can raise an unhandled
KeyError for invalid agent names; update the logic to validate the chosen agent
before indexing RUNNERS: check whether the resolved agent value is in RUNNERS
(or use RUNNERS.get(agent)) and if not raise a clear ValueError (or SystemExit)
that includes the invalid value and a sorted/listed set of valid runner keys
(i.e., reference the agent variable and RUNNERS mapping) so callers see
available options and the program fails closed with an explanatory message.

In `@skills/eval-run/scripts/score.py`:
- Line 489: Remove the unnecessary f-string prefixes on static print calls in
score.py: replace print(f"\n  REGRESSIONS: 0") and the similar print(...) at the
other occurrence (line referenced by the review) with plain string prints (e.g.,
print("\n  REGRESSIONS: 0")) to fix F541 lint errors; locate these in the same
scope where the "REGRESSIONS" print appears in the scoring logic and update both
print calls accordingly.
- Around line 117-118: The code computes parallelism using parallelism =
min(len(case_dirs), os.cpu_count() or 4) and then creates a ThreadPoolExecutor
which fails when case_dirs is empty; add an explicit check on case_dirs (e.g.,
if not case_dirs) before computing parallelism/instantiating ThreadPoolExecutor
and handle it by returning early or skipping work with an appropriate
message/exit code; update the area around the parallelism and lock variables
(parallelism, case_dirs, lock) and any ThreadPoolExecutor creation to use this
early-exit guard so a zero-length case_dirs never leads to worker-count = 0.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 113-121: The YAML/JSON parsing in workspace.py currently calls
yaml.safe_load(name) and json.load(name) without exception handling, so a
malformed case file can abort the run; wrap the YAML branch (the block using
yaml.safe_load on variable name) in a try/except that catches yaml.YAMLError
(and general Exception/ValueError as fallback), log or skip the file and
continue instead of returning/raising, and likewise wrap the JSON branch (the
block using json.load on variable name) in a try/except that catches
json.JSONDecodeError (and ValueError/Exception fallback), skip the invalid file
and continue; ensure you reference the same control flow around the variable
name and preserve the existing behavior of returning a dict only when parsing
succeeds.

In `@skills/eval-run/SKILL.md`:
- Around line 31-33: The two fenced code blocks in SKILL.md (one containing "Use
the Skill tool to invoke /eval-analyze [--skill <skill>]" and the other at the
block around lines 141-143) lack language identifiers and trigger MD040; add the
appropriate fenced code language specifier (e.g., bash, sh, or text as
appropriate for the snippet) after the opening ``` so each fence becomes ```bash
(or another correct language) to satisfy the linter.
- Around line 25-27: The config-existence check currently hardcodes "eval.yaml"
and ignores the CLI --config argument; update the check to use the parsed config
variable used by the script (the same variable populated from --config, e.g.,
CONFIG or CONFIG_PATH) instead of the literal "eval.yaml", ensure you quote the
variable when testing (to handle spaces) and fall back to "eval.yaml" only if
the parsed config variable is empty so the documented --config <path> flow is
respected.

In `@skills/eval-setup/SKILL.md`:
- Around line 97-108: The current shell snippet silences stderr (2>/dev/null)
which masks errors from EvalConfig.from_yaml and makes the fallback message
misleading; update the snippet to either remove the stderr redirection so
exceptions surface, or explicitly check for file existence and catch errors from
EvalConfig.from_yaml to log the real error before printing the fallback.
Specifically adjust the block that calls EvalConfig.from_yaml and
setup_experiment (referencing EvalConfig.from_yaml and setup_experiment and the
config.mlflow_experiment branch) to surface or log parse/validation exceptions
instead of redirecting stderr to /dev/null.

---

Duplicate comments:
In `@skills/eval-run/scripts/score.py`:
- Around line 207-223: The prompt and context file handling (variables
prompt_path, path; logic around jc.prompt_file and jc.context) currently reads
arbitrary paths; fix by validating and sandboxing paths before reading: resolve
each Path with .resolve() and ensure the resolved path is inside a permitted
base (e.g., project_root.resolve() or a specific judges directory) — reject
absolute or traversing paths that fall outside that base, raise an error/log and
skip or abort; additionally enforce an allowlist of allowed filename patterns or
extensions if appropriate and avoid reading dotfiles or known secret files. Use
the same checks for prompt_path and each ctx_path and only call read_text()
after validation, updating any error messages to reference jc.name.
- Around line 179-186: The current _make_inline_check compiles and execs
untrusted jc.check with full builtins, enabling arbitrary code execution;
replace this by disallowing direct exec of jc.check and instead run checks in a
hardened sandbox (e.g., execute the check in a separate, isolated process with
no inherited secrets, restricted environment, resource/time limits, and no
access to host builtins) or validate/transform jc.check into a safe subset via
AST whitelisting before compilation; specifically, remove the direct exec(code,
ns) flow in _make_inline_check and either (a) invoke a sandboxed runner that
receives the wrapped function string and returns results, or (b) replace ns with
a minimal, explicit safe API and perform AST checks on jc.check to block
dangerous nodes (Import, Exec, Eval, Attribute access to os/sys, etc.) before
creating check_fn.
- Around line 51-79: The artifact loading allows untrusted output.path to escape
the intended case_dir via path traversal or symlinks; update the loops that use
config.outputs / output.path / artifact_dir and the rglob/iterdir traversal to
validate and constrain file access: normalize and reject absolute or
parent-traversal paths (resolve output.path against case_dir), compute
resolved_artifact_dir = (case_dir / out_path).resolve() and ensure
resolved_artifact_dir is within case_dir.resolve() (skip and log if not), and
when iterating files check each file's resolved path (f.resolve()) is also
within case_dir.resolve() before reading into record["files"] or setting
{key}_content/_file; also avoid following symlinks if you intend to disallow
them (use lstat or skip f.is_symlink()), and skip any files that fail
containment checks to prevent exfiltration.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 91-99: The code creates symlinks from raw args.symlinks without
validating paths, allowing path traversal or absolute paths to escape the
workspace; update the logic that builds symlink_names and creates target/link
(referencing args.symlinks, symlink_names, target, link, project_root,
workspace, and link.symlink_to) to validate each entry: reject or normalize
entries containing absolute paths or parent-traversal segments (".." or starting
with "/"), resolve the candidate target with project_root.resolve() and ensure
target.resolve().is_relative_to(project_root) (or catch ValueError from
relative_to) and similarly ensure the final link would reside under workspace
before calling link.symlink_to; log or raise and skip invalid entries.
- Around line 55-58: The current workspace setup uses args.run_id directly and
recursively deletes the resolved path (workspace), allowing path traversal and
unsafe deletes; fix by deriving the workspace under a fixed base (e.g., base =
Path("/tmp/agent-eval")), sanitize and validate args.run_id (reject path
separators, .., empty or absolute values) or better generate a safe unique name
(use tempfile or uuid) instead of trusting input, resolve and confirm
workspace.resolve().is_relative_to(base.resolve()) before any rmtree, and only
call shutil.rmtree(workspace) when that check passes; update references to
workspace and args.run_id accordingly and ensure directory creation uses
workspace.mkdir(parents=True, exist_ok=False).

---

Nitpick comments:
In `@agent_eval/__main__.py`:
- Around line 19-25: The current code mutates sys.argv in-place before
delegating to subcommands (_config_command and state.main), which is fragile;
instead, change the call sites to pass the slice explicitly (e.g., args =
sys.argv[1:]) and update _config_command and the state.main function (imported
as state_main) to accept an args parameter (e.g., def _config_command(args): and
def main(args):) and use that passed-in list for argument parsing; provide a
sensible default signature (args=None) if needed for backwards compatibility and
update internal parsing to use the provided args rather than reading sys.argv
directly.

In `@agent_eval/agent/claude_code.py`:
- Around line 124-125: The code uses a magic literal proc.wait(timeout=30) while
the function accepts timeout_s (default 600); replace the hardcoded 30 with a
named drain timeout variable (e.g., drain_timeout = 30) or derive it from
timeout_s (e.g., min(30, timeout_s)) and add a brief inline comment next to
proc.wait explaining this is a short drain wait to read stderr after stdout
completion—not the main process execution timeout governed by timeout_s;
reference the proc.wait call and the timeout_s parameter to locate and update
the behavior.

In `@agent_eval/mlflow/experiment.py`:
- Around line 27-39: The ensure_server function is using the heavyweight
experiments/search endpoint for health checks; change it to probe the
lightweight /health endpoint instead by calling urllib.request.urlopen against
f"http://127.0.0.1:{port}/health" (keep the existing timeout and try/except
semantics), i.e., update the URL used in ensure_server so the health probe
matches the /health check used in check_env.py and still returns True on success
and False on exception.
- Around line 72-74: In log_feedback, don't silently swallow all exceptions;
replace the bare except/pass with an except Exception as e that logs the failure
including trace_id and the exception details at debug/trace level (e.g., use
logging.getLogger(__name__).debug with a message like "Failed to log feedback
for trace %s: %s" and include exc_info=True or the exception string) so
diagnostics are preserved without changing caller behavior.

In `@agent_eval/state.py`:
- Around line 85-89: In the "read-ids" branch (the cmd == "read-ids" block) the
list comprehension that builds ids uses the ambiguous loop variable named `l`;
rename that variable to `line` in the comprehension (i.e., change `[l.strip()
for l in path.read_text().splitlines() if l.strip()]` to use `line`) so the
variable is clearer and avoid confusion with `1`/`I`; ensure the variable name
change is only within the list comprehension that assigns to `ids`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 1d9d3416-66d7-4ee6-82e5-0823d93777b4

📥 Commits

Reviewing files that changed from the base of the PR and between bc48951 and 4c6526f.

📒 Files selected for processing (30)
  • .claude-plugin/plugin.json
  • .gitignore
  • CLAUDE.md
  • README.md
  • agent_eval/__init__.py
  • agent_eval/__main__.py
  • agent_eval/agent/__init__.py
  • agent_eval/agent/base.py
  • agent_eval/agent/claude_code.py
  • agent_eval/config.py
  • agent_eval/mlflow/__init__.py
  • agent_eval/mlflow/experiment.py
  • agent_eval/state.py
  • eval.yaml
  • pyproject.toml
  • skills/eval-analyze/SKILL.md
  • skills/eval-analyze/prompts/analyze-skill.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • skills/eval-analyze/scripts/discover.py
  • skills/eval-mlflow/SKILL.md
  • skills/eval-optimize/SKILL.md
  • skills/eval-run/SKILL.md
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-run/scripts/collect.py
  • skills/eval-run/scripts/execute.py
  • skills/eval-run/scripts/score.py
  • skills/eval-run/scripts/workspace.py
  • skills/eval-setup/SKILL.md
  • skills/eval-setup/scripts/check_env.py
✅ Files skipped from review due to trivial changes (13)
  • agent_eval/mlflow/init.py
  • agent_eval/init.py
  • .gitignore
  • .claude-plugin/plugin.json
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • pyproject.toml
  • CLAUDE.md
  • eval.yaml
  • skills/eval-analyze/prompts/analyze-skill.md
  • skills/eval-mlflow/SKILL.md
  • agent_eval/config.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • agent_eval/agent/base.py
  • skills/eval-run/scripts/collect.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
skills/eval-run/scripts/score.py (1)

327-327: ⚠️ Potential issue | 🟡 Minor

Empty case_ids causes ThreadPoolExecutor(max_workers=0) failure.

compare_runs lacks the early-return guard present in score_cases (line 125). When case_ids is empty, parallelism becomes 0.

Proposed fix
 def compare_runs(run_a_dir, run_b_dir, config, case_ids,
                  prompt=None, prompt_file=None, model="claude-sonnet-4-6"):
     """Compare two runs using position-swapped LLM judge."""
+    if not case_ids:
+        return {"run_a": run_a_dir.name, "run_b": run_b_dir.name,
+                "cases_compared": 0, "wins_a": 0, "wins_b": 0,
+                "ties": 0, "errors": 0, "per_case": []}
     comparison_prompt = prompt
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 327, compare_runs computes
parallelism as min(len(case_ids), os.cpu_count() or 4) which yields 0 when
case_ids is empty and causes ThreadPoolExecutor(max_workers=0) to fail; add the
same early-return guard used in score_cases (or otherwise ensure parallelism is
at least 1) so compare_runs returns immediately when case_ids is empty (or sets
parallelism = max(1, min(len(case_ids), os.cpu_count() or 4))). Locate the
comparison in the compare_runs function and implement the empty-case guard or
the min-1 clamp accordingly.
🧹 Nitpick comments (10)
agent_eval/agent/claude_code.py (1)

178-183: Class attribute should be immutable.

_SAFE_ENV_KEYS is a mutable set. Use frozenset to prevent accidental modification.

Proposed fix
     # Environment keys safe to forward to evaluated skills
-    _SAFE_ENV_KEYS = {
+    _SAFE_ENV_KEYS = frozenset({
         "PATH", "HOME", "USER", "SHELL", "LANG", "LC_ALL", "TERM",
         "ANTHROPIC_API_KEY", "ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION",
         "MLFLOW_TRACKING_URI", "MLFLOW_EXPERIMENT_NAME",
         "CLAUDE_CODE_SUBAGENT_MODEL",
-    }
+    })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 178 - 183, The class attribute
_SAFE_ENV_KEYS is currently a mutable set; change it to an immutable frozenset
to prevent accidental modification by replacing the set literal with a
frozenset(...) construction (e.g., _SAFE_ENV_KEYS = frozenset({ "PATH", "HOME",
"USER", "SHELL", "LANG", "LC_ALL", "TERM", "ANTHROPIC_API_KEY",
"ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION", "MLFLOW_TRACKING_URI",
"MLFLOW_EXPERIMENT_NAME", "CLAUDE_CODE_SUBAGENT_MODEL", })) ensuring all
existing references to _SAFE_ENV_KEYS continue to work.
skills/eval-run/scripts/collect.py (2)

112-112: Remove redundant import re—already imported at module level (line 16).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` at line 112, There is a redundant
duplicate "import re" in the module; remove the second occurrence (the extra
import statement seen in the diff) so only the single module-level "import re"
remains (the original import is already at the top of collect.py), leaving no
duplicate imports in the file.

122-124: Document the 50% threshold heuristic for prefix grouping.

When len(prefixes) >= num_cases * 0.5, prefix-based grouping activates even if nearly half the files lack a matching prefix. This could silently drop files or produce unexpected groupings. Consider requiring exact match (len(prefixes) == num_cases) or logging when files are excluded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` around lines 122 - 124, The current
heuristic in the prefix grouping block uses len(prefixes) >= num_cases * 0.5
which can silently exclude up to half the files; update the logic in collect.py
around the prefixes / num_cases check to either require an exact match (use
len(prefixes) == num_cases) or, if you want to keep a relaxed threshold, add
explicit logging that lists which files were excluded and why before returning
(use the existing prefixes dict and sorted(prefixes.keys()) to compute excluded
entries), so callers can see when files are dropped and why.
agent_eval/state.py (2)

92-93: Rename ambiguous loop variable l to line.

Ruff E741 flags single-letter l as visually confusable with 1.

Proposed fix
-            ids = [l.strip() for l in path.read_text().splitlines() if l.strip()]
+            ids = [line.strip() for line in path.read_text().splitlines() if line.strip()]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 92 - 93, Rename the ambiguous
list-comprehension loop variable `l` to a clearer name like `line` to avoid
confusable single-letter identifiers; update the expression to ids =
[line.strip() for line in path.read_text().splitlines() if line.strip()] and
leave the subsequent print(" ".join(ids)) unchanged so behavior remains
identical.

95-97: clean removes relative tmp/ from CWD—document or parameterize.

If the script is invoked from an unexpected working directory, shutil.rmtree("tmp") could delete unintended data. Consider accepting an explicit path argument or resolving against a known base.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 95 - 97, The "clean" branch currently calls
shutil.rmtree("tmp") which deletes a relative tmp/ from the current working
directory (in function/branch where cmd == "clean"), risking removal of
unintended files; change it to accept or resolve an explicit path instead: add a
parameter or CLI arg for the tmp directory (or compute a safe base like the
repository/app root using __file__ or a BASE_DIR constant) and call
shutil.rmtree on the resolved absolute path (use os.path.abspath/os.path.join)
and validate it is inside the intended base before deleting; reference the cmd
== "clean" branch and the shutil.rmtree("tmp", ignore_errors=True) call when
making the change.
skills/eval-analyze/scripts/discover.py (1)

32-34: Frontmatter parsing fails on files with multiple --- separators in body.

content.split("---")[1] grabs text between the first two --- markers, but if the body also contains --- (e.g., horizontal rules), the split behavior is correct. However, if frontmatter itself is malformed with extra ---, yaml.safe_load will silently parse partial content. Consider using content.split("---", 2) and validating len(parts) >= 3 as done in check_eval_md.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/discover.py` around lines 32 - 34, Frontmatter
parsing in discover.py uses content.split("---")[1] which can mis-handle
multiple separators; change it to split with a max of 2 (use
content.split("---", 2)), verify that parts length is >= 3 (matching the pattern
used in check_eval_md), then pass parts[1] to yaml.safe_load and only extract
desc = fm.get("description", "")[:80] when parsing succeeds; if validation
fails, skip or handle as a non-frontmatter file to avoid silent partial YAML
parsing.
skills/eval-run/scripts/workspace.py (1)

134-138: Move import json to module level.

Importing inside the loop on each .json file iteration is wasteful.

Proposed fix
 import argparse
+import json
 import re

Then remove line 135.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 134 - 138, The inline
import inside the file-iteration block should be moved to the module level:
remove the "import json" from inside the elif branch in workspace.py (the block
that checks name.is_file() and name.suffix == ".json") and add a single "import
json" at the top of the module; keep the rest of the logic that opens and
json.load(f) unchanged (refer to the branch handling name.is_file() /
name.suffix == ".json").
skills/eval-run/scripts/score.py (2)

557-557: Remove extraneous f prefix on static string (F541).

-        print(f"No summary found. Run judges first.", file=sys.stderr)
+        print("No summary found. Run judges first.", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 557, The print call currently uses
an unnecessary f-string: replace print(f"No summary found. Run judges first.",
file=sys.stderr) with a plain string print("No summary found. Run judges
first.", file=sys.stderr) to remove the extraneous `f` prefix (fix F541) in the
scoring script.

217-217: Chain exception with raise ... from err (B904).

-            raise ImportError("mlflow[genai] required for LLM judges")
+            raise ImportError("mlflow[genai] required for LLM judges") from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 217, The ImportError raised with
the message "mlflow[genai] required for LLM judges" should preserve the original
exception context by chaining it; modify the except block that detects the
missing/failed import so that you capture the original exception (e.g., "except
Exception as err:" or the specific caught exception) and replace the plain
"raise ImportError(\"mlflow[genai] required for LLM judges\")" with "raise
ImportError(\"mlflow[genai] required for LLM judges\") from err" so the original
error is propagated (locate the raise in score.py where that exact message is
used).
skills/eval-optimize/SKILL.md (1)

63-63: Define a deterministic rollback procedure before edits; “revert” is currently non-actionable.

Line 63 requires revert-on-regression, but the steps never require creating a backup/snapshot before editing SKILL.md. In iterative loops, this can make regressions irreversible and contaminate subsequent iterations.

Use a mandatory pre-edit checkpoint step (copy original file or patch snapshot) and a concrete rollback action tied to that checkpoint.

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-optimize/SKILL.md` at line 63, The "f. Check for regressions"
step mentions "revert" but lacks a deterministic rollback procedure; before any
edits to SKILL.md add an explicit pre-edit checkpoint (e.g., create a
timestamped backup copy or create a git branch/patch snapshot) and update the
checklist to require creating that checkpoint; then replace the vague "revert"
action with a concrete rollback instruction that restores the checkpoint
(restore the backup file or checkout the branch/apply the patch) tied to the
checkpoint name so regressions can be undone reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent_eval/__main__.py`:
- Line 116: yaml.safe_load(data_yaml) can return None which later causes
data.get("frontmatter", data) to raise AttributeError; after calling
yaml.safe_load in the module (the variable `data`), coerce None (or non-dict
results) to a dict before using .get — e.g., replace the direct assignment with
a guarded one such as assigning data = yaml.safe_load(data_yaml) or {} and if
not isinstance(data, dict): data = {'__raw__': data} (or simply {}) so
subsequent calls to data.get("frontmatter", data) are safe.

In `@agent_eval/agent/claude_code.py`:
- Around line 51-53: The timeout_s parameter is declared but never used: replace
the hardcoded 30s timeout in the subprocess invocation with the timeout_s
argument and update the subprocess timeout handling/error message to reference
timeout_s; specifically, locate the subprocess.run call (where timeout=30 is
currently hardcoded) and change it to use timeout=timeout_s, and ensure the
exception logging that currently mentions timeout_s actually uses the same
timeout_s variable so the message matches the runtime behavior.
- Around line 137-141: The current except Exception as e block in claude_code.py
(the handler that builds a RunResult using start/duration and stderr=str(e)) is
too broad; update it to catch only the expected exceptions (e.g.,
subprocess.CalledProcessError, TimeoutError, OSError) and return a RunResult for
those, and ensure KeyboardInterrupt and SystemExit (or BaseException) are
re-raised rather than swallowed—alternatively add a dedicated except
(KeyboardInterrupt, SystemExit): raise before the narrower except so only
intended errors are caught while preserving proper signal/interrupt propagation.

In `@agent_eval/config.py`:
- Around line 140-143: The project_root property currently returns Path.cwd(),
which breaks relative-path resolution when a config is loaded from a file in a
different directory; update the config loader (e.g., EvalConfig.from_yaml) to
capture the YAML file's parent directory into a stored attribute (e.g.,
self.config_dir) and change project_root to return that stored config_dir (the
parent of the loaded YAML path) instead of Path.cwd(); ensure project_root falls
back to Path.cwd() only if no file path was provided.

In `@skills/eval-optimize/SKILL.md`:
- Around line 45-53: Validate and constrain any dynamic identifiers before using
them in filesystem paths: enforce an allowlist regex (e.g., ^[A-Za-z0-9._-]+$)
for config.skill and run-id, reject values containing path separators or "..",
canonicalize with realpath/resolution, and verify the resolved paths remain
under the expected root directories (e.g., `.claude/skills/` for config.skill
and `eval/runs/` for run-id) before performing any read/write (references:
config.skill, run-id, `.claude/skills/<skill>/SKILL.md`,
`eval/runs/<id>-iter-<N>/summary.yaml`, stdout.log).

In `@skills/eval-run/scripts/score.py`:
- Around line 191-203: The function _make_inline_check currently compiles and
execs jc.check with full __builtins__, enabling arbitrary code execution; remove
the direct exec-based approach and either (A) disallow inline checks by raising
an error when jc.check is present (so callers must supply safe, pre-defined
check functions), or (B) run the inline check inside a sandboxed subprocess:
spawn a separate Python process (not exec in-process) that receives the outputs
via stdin (e.g., JSON), executes the provided jc.check string in a restricted
environment with no __builtins__/empty env, resource limits, and no
network/filesystem access, and returns the result via stdout; update
_make_inline_check to create a scorer that communicates with that subprocess
instead of calling check_fn directly, and remove passing __builtins__ into exec
to avoid in-process evaluation.

---

Duplicate comments:
In `@skills/eval-run/scripts/score.py`:
- Line 327: compare_runs computes parallelism as min(len(case_ids),
os.cpu_count() or 4) which yields 0 when case_ids is empty and causes
ThreadPoolExecutor(max_workers=0) to fail; add the same early-return guard used
in score_cases (or otherwise ensure parallelism is at least 1) so compare_runs
returns immediately when case_ids is empty (or sets parallelism = max(1,
min(len(case_ids), os.cpu_count() or 4))). Locate the comparison in the
compare_runs function and implement the empty-case guard or the min-1 clamp
accordingly.

---

Nitpick comments:
In `@agent_eval/agent/claude_code.py`:
- Around line 178-183: The class attribute _SAFE_ENV_KEYS is currently a mutable
set; change it to an immutable frozenset to prevent accidental modification by
replacing the set literal with a frozenset(...) construction (e.g.,
_SAFE_ENV_KEYS = frozenset({ "PATH", "HOME", "USER", "SHELL", "LANG", "LC_ALL",
"TERM", "ANTHROPIC_API_KEY", "ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION",
"MLFLOW_TRACKING_URI", "MLFLOW_EXPERIMENT_NAME", "CLAUDE_CODE_SUBAGENT_MODEL",
})) ensuring all existing references to _SAFE_ENV_KEYS continue to work.

In `@agent_eval/state.py`:
- Around line 92-93: Rename the ambiguous list-comprehension loop variable `l`
to a clearer name like `line` to avoid confusable single-letter identifiers;
update the expression to ids = [line.strip() for line in
path.read_text().splitlines() if line.strip()] and leave the subsequent print("
".join(ids)) unchanged so behavior remains identical.
- Around line 95-97: The "clean" branch currently calls shutil.rmtree("tmp")
which deletes a relative tmp/ from the current working directory (in
function/branch where cmd == "clean"), risking removal of unintended files;
change it to accept or resolve an explicit path instead: add a parameter or CLI
arg for the tmp directory (or compute a safe base like the repository/app root
using __file__ or a BASE_DIR constant) and call shutil.rmtree on the resolved
absolute path (use os.path.abspath/os.path.join) and validate it is inside the
intended base before deleting; reference the cmd == "clean" branch and the
shutil.rmtree("tmp", ignore_errors=True) call when making the change.

In `@skills/eval-analyze/scripts/discover.py`:
- Around line 32-34: Frontmatter parsing in discover.py uses
content.split("---")[1] which can mis-handle multiple separators; change it to
split with a max of 2 (use content.split("---", 2)), verify that parts length is
>= 3 (matching the pattern used in check_eval_md), then pass parts[1] to
yaml.safe_load and only extract desc = fm.get("description", "")[:80] when
parsing succeeds; if validation fails, skip or handle as a non-frontmatter file
to avoid silent partial YAML parsing.

In `@skills/eval-optimize/SKILL.md`:
- Line 63: The "f. Check for regressions" step mentions "revert" but lacks a
deterministic rollback procedure; before any edits to SKILL.md add an explicit
pre-edit checkpoint (e.g., create a timestamped backup copy or create a git
branch/patch snapshot) and update the checklist to require creating that
checkpoint; then replace the vague "revert" action with a concrete rollback
instruction that restores the checkpoint (restore the backup file or checkout
the branch/apply the patch) tied to the checkpoint name so regressions can be
undone reliably.

In `@skills/eval-run/scripts/collect.py`:
- Line 112: There is a redundant duplicate "import re" in the module; remove the
second occurrence (the extra import statement seen in the diff) so only the
single module-level "import re" remains (the original import is already at the
top of collect.py), leaving no duplicate imports in the file.
- Around line 122-124: The current heuristic in the prefix grouping block uses
len(prefixes) >= num_cases * 0.5 which can silently exclude up to half the
files; update the logic in collect.py around the prefixes / num_cases check to
either require an exact match (use len(prefixes) == num_cases) or, if you want
to keep a relaxed threshold, add explicit logging that lists which files were
excluded and why before returning (use the existing prefixes dict and
sorted(prefixes.keys()) to compute excluded entries), so callers can see when
files are dropped and why.

In `@skills/eval-run/scripts/score.py`:
- Line 557: The print call currently uses an unnecessary f-string: replace
print(f"No summary found. Run judges first.", file=sys.stderr) with a plain
string print("No summary found. Run judges first.", file=sys.stderr) to remove
the extraneous `f` prefix (fix F541) in the scoring script.
- Line 217: The ImportError raised with the message "mlflow[genai] required for
LLM judges" should preserve the original exception context by chaining it;
modify the except block that detects the missing/failed import so that you
capture the original exception (e.g., "except Exception as err:" or the specific
caught exception) and replace the plain "raise ImportError(\"mlflow[genai]
required for LLM judges\")" with "raise ImportError(\"mlflow[genai] required for
LLM judges\") from err" so the original error is propagated (locate the raise in
score.py where that exact message is used).

In `@skills/eval-run/scripts/workspace.py`:
- Around line 134-138: The inline import inside the file-iteration block should
be moved to the module level: remove the "import json" from inside the elif
branch in workspace.py (the block that checks name.is_file() and name.suffix ==
".json") and add a single "import json" at the top of the module; keep the rest
of the logic that opens and json.load(f) unchanged (refer to the branch handling
name.is_file() / name.suffix == ".json").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0b3fcd89-3363-4b9f-80ec-21470da4a1db

📥 Commits

Reviewing files that changed from the base of the PR and between 4c6526f and 9219ee1.

📒 Files selected for processing (13)
  • .gitignore
  • agent_eval/__main__.py
  • agent_eval/agent/claude_code.py
  • agent_eval/config.py
  • agent_eval/state.py
  • skills/eval-analyze/scripts/discover.py
  • skills/eval-optimize/SKILL.md
  • skills/eval-run/SKILL.md
  • skills/eval-run/scripts/collect.py
  • skills/eval-run/scripts/execute.py
  • skills/eval-run/scripts/score.py
  • skills/eval-run/scripts/workspace.py
  • skills/eval-setup/SKILL.md
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (2)
  • skills/eval-run/scripts/execute.py
  • skills/eval-setup/SKILL.md

@astefanutti astefanutti self-assigned this Apr 3, 2026
@astefanutti
Copy link
Copy Markdown
Contributor Author

/approve

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (3)
agent_eval/agent/claude_code.py (2)

125-136: ⚠️ Potential issue | 🔴 Critical

Bug: timeout_s parameter ignored; hardcoded 30s used instead.

Line 125 uses timeout=30 while timeout_s defaults to 600. The error message at line 134 misleadingly reports timeout_s. This was flagged previously and remains unfixed.

Proposed fix
             stderr = proc.stderr.read()
-            proc.wait(timeout=30)
+            proc.wait(timeout=timeout_s)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 125 - 136, The timeout handling
incorrectly uses a hardcoded 30s in the proc.wait call and then reports
timeout_s in the error; update the proc.wait call to use the timeout_s parameter
(instead of 30) so the TimeoutExpired branch and the RunResult(stderr=...)
message reflect the actual timeout value; locate the proc.wait(...) call and the
except subprocess.TimeoutExpired block in claude_code.py (symbols: proc.wait,
timeout_s, start, RunResult) and replace the hardcoded literal with the
timeout_s variable so the duration and stderr string are consistent.

137-141: ⚠️ Potential issue | 🟡 Minor

Overly broad exception handler (CWE-755).

Catching bare Exception swallows KeyboardInterrupt propagation issues and obscures root causes. Narrow to expected exceptions.

Proposed fix
-        except Exception as e:
+        except (OSError, ValueError) as e:
             duration = time.monotonic() - start
             return RunResult(
                 exit_code=-1, stdout="", stderr=str(e), duration_s=duration,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 137 - 141, The except Exception
as e block that returns a RunResult with stderr=str(e) is too broad; replace it
by catching only the expected exception types (for example RuntimeError,
ValueError, OSError, asyncio.TimeoutError or asyncio.CancelledError as
appropriate for this function) and convert those to the RunResult, while
re-raising any other/unexpected exceptions (and always let BaseExceptions like
KeyboardInterrupt and SystemExit propagate). Locate the except block that builds
RunResult using start and only handle the specific error classes your code can
reasonably produce, using "raise" to bubble up everything else.
skills/eval-run/scripts/score.py (1)

327-327: ⚠️ Potential issue | 🟡 Minor

Empty case_ids will crash ThreadPoolExecutor(max_workers=0).

Unlike cmd_judges which handles empty case lists at line 125, compare_runs has no early return. If both runs exist but share no case IDs, this crashes.

Fix
 def compare_runs(run_a_dir, run_b_dir, config, case_ids,
                  prompt=None, prompt_file=None, model="claude-sonnet-4-6"):
     """Compare two runs using position-swapped LLM judge."""
+    if not case_ids:
+        return {"run_a": run_a_dir.name, "run_b": run_b_dir.name,
+                "cases_compared": 0, "wins_a": 0, "wins_b": 0,
+                "ties": 0, "errors": 0, "per_case": []}
     comparison_prompt = prompt
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 327, The code computes parallelism
as min(len(case_ids), os.cpu_count() or 4) which can be zero when case_ids is
empty and will crash ThreadPoolExecutor(max_workers=0); in the compare_runs flow
either return early when case_ids is empty (similar to cmd_judges) or ensure
parallelism is at least 1 (e.g., set parallelism = max(1, min(len(case_ids),
os.cpu_count() or 4))); update the variable where it’s defined and any
ThreadPoolExecutor(max_workers=parallelism) usage accordingly to avoid creating
a pool with zero workers.
🧹 Nitpick comments (8)
skills/eval-run/scripts/collect.py (1)

106-127: Duplicate re import.

re is already imported at line 16. Remove the redundant import inside the function.

Proposed fix
 def _group_files(files, num_cases):
     """Group files into per-case bundles.

     Tries to detect a common ID prefix pattern (e.g., "RFE-001", "TASK-002").
     If found, groups by prefix. Otherwise, distributes one file per case.
     """
-    import re
-
     # Try to find a common prefix pattern: WORD-NNN at start of filename
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` around lines 106 - 127, Remove the
redundant local import of the regex module inside _group_files; instead rely on
the existing top-level import (re) already present, so delete the line "import
re" within the _group_files function and keep the rest of the logic unchanged;
reference the function name _group_files to locate where to remove the duplicate
import.
agent_eval/state.py (1)

89-93: Ambiguous variable name l.

Per E741, single-letter l is easily confused with 1. Use line for clarity.

Proposed fix
     elif cmd == "read-ids":
         path = Path(sys.argv[2])
         if path.exists():
-            ids = [l.strip() for l in path.read_text().splitlines() if l.strip()]
+            ids = [line.strip() for line in path.read_text().splitlines() if line.strip()]
             print(" ".join(ids))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` around lines 89 - 93, Replace the ambiguous
single-letter variable `l` in the list comprehension inside the "read-ids"
branch (the code that builds `ids` when `cmd == "read-ids"`) with a clearer name
like `line`; update the comprehension to use `line.strip()` and the filter `if
line.strip()` so the behavior of `ids = [ ... ]` remains identical but is more
readable and avoids confusion with `1`.
skills/eval-run/scripts/score.py (1)

217-217: Chain exception with raise ... from err for clearer tracebacks (B904).

Fix
-        raise ImportError("mlflow[genai] required for LLM judges")
+        raise ImportError("mlflow[genai] required for LLM judges") from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 217, The ImportError raise in the
mlflow import failure should chain the original exception to preserve traceback;
update the except block around the mlflow[genai] import in score.py (the block
that currently does `raise ImportError("mlflow[genai] required for LLM
judges")`) to capture the caught exception (e.g., `except Exception as err:`)
and re-raise using `raise ImportError("mlflow[genai] required for LLM judges")
from err` so the original error is chained.
CLAUDE.md (1)

11-11: Add language specifiers to fenced code blocks.

Line 11 (architecture diagram) and line 65 (CLI usage) lack language hints. Use text for the diagram and bash for the commands.

Also applies to: 65-65

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 11, Add language specifiers to the two fenced code blocks:
update the architecture diagram fenced block (the "architecture diagram" block)
to start with ```text and update the CLI commands fenced block (the "CLI usage"
block) to start with ```bash so the diagram and shell commands render correctly
with syntax highlighting.
skills/eval-analyze/scripts/find_skills.py (1)

43-44: Silent except: pass hides plugin.json parse errors.

Swallowing all exceptions makes debugging difficult when plugin.json contains malformed JSON or unexpected structure. Log or emit a warning.

Fix
-        except Exception:
-            pass
+        except Exception as e:
+            print(f"  WARNING: failed to parse {plugin_json}: {e}", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/find_skills.py` around lines 43 - 44, Replace the
silent "except Exception: pass" in find_skills.py (the try/except around
reading/parsing each plugin.json) with targeted error handling: catch
json.JSONDecodeError and log a warning that includes the plugin file path and
the exception details, and for other unexpected exceptions catch Exception as e
and log an error with context (file name and exception) before continuing;
ensure you use the module logger (or logging.warning/logging.error) instead of
swallowing errors so malformed plugin.json is visible during debugging.
skills/eval-analyze/SKILL.md (1)

139-139: Path with special characters in shell one-liner could break.

If the skill path contains quotes or shell metacharacters, this inline Python command will fail or behave unexpectedly. Consider recommending the agent use a script or proper quoting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/SKILL.md` at line 139, The one-liner using python3 -c
with a raw '<skill-path>/SKILL.md' can break if the path contains quotes or
shell metacharacters; replace the inline command with a safe approach such as
passing the path as an argument (e.g. python3 -c "import hashlib,sys; from
pathlib import Path;
print(hashlib.sha256(Path(sys.argv[1]).read_bytes()).hexdigest()[:12])" --
"<skill-path>/SKILL.md") or recommend using a short helper script that accepts
the path as argv to avoid shell quoting issues; update the SKILL.md example to
show the argument-based command and note that callers should quote the path.
README.md (1)

7-7: Add language specifiers to CLI/diagram code blocks.

Multiple fenced blocks lack language hints. Use text for ASCII diagrams and bash for CLI examples to improve rendering and silence MD040.

Also applies to: 60-68, 79-79, 306-342

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 7, Several fenced code blocks in README.md are missing
language specifiers; update each ``` ... ``` block that contains an ASCII
diagram to use ```text and each block that contains CLI commands to use ```bash
so Markdown lint MD040 is satisfied and rendering improves; search for the
unannotated fenced code blocks (the plain ``` delimiters) and replace them with
the appropriate ```text or ```bash specifier depending on content.
skills/eval-analyze/references/eval-yaml-template.md (1)

88-101: Add language specifiers to fenced code blocks.

Lines 88, 99, and 120 use fenced code blocks without language hints. Adding yaml or text improves syntax highlighting and silences MD040 warnings.

Suggested fix
-```
+```yaml
 Each case directory contains:
-```
+```text
 Cases contain input files and reference outputs.
-```
+```text
 Score 1: Missing most requirements, major errors

Also applies to: 120-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/references/eval-yaml-template.md` around lines 88 - 101,
Add explicit language specifiers to the fenced code blocks in
eval-yaml-template.md: change the block that begins "Each case directory
contains:" to ```yaml and change the plain text example blocks (the "Bad —
vague, no specific field names:" example starting "Cases contain input files and
reference outputs." and the score/example block starting "Score 1: Missing most
requirements, major errors") to ```text; apply the same fix to the other similar
fenced block group around the later example (the block containing "Score 1:
Missing most requirements, major errors") so all code fences include a language
hint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent_eval/config.py`:
- Around line 126-133: Validate and sanitize any file paths provided for
JudgeConfig.prompt_file and each entry of JudgeConfig.context when constructing
the JudgeConfig: ensure values are strings, resolve them via a canonicalizing
API (e.g., Path.resolve()) and verify they are inside an allowed base directory
(compare resolved path prefixes) and not absolute / containing traversal that
escapes the base; if validation fails, reject the config or raise a clear error.
Implement this check in the code path that instantiates JudgeConfig (where
prompt_file and context are read) so every prompt_file and each context list
element is normalized and compared against the project/data base before being
stored in the JudgeConfig object.

In `@agent_eval/mlflow/experiment.py`:
- Around line 59-74: The current log_feedback function swallows all exceptions,
hiding failures; update it to catch exceptions as e and log the error instead of
silently passing so issues with mlflow, AssessmentSource, or invalid trace_id
become visible—use the module logger (e.g., logging.getLogger(__name__)) and
call logger.exception or logger.error including context (trace_id, name,
source_type/source_id) when mlflow.log_feedback or AssessmentSource construction
fails; keep the existing behavior of sending rationale=None when empty and do
not rework the function signature (function: log_feedback, symbols: mlflow,
AssessmentSource, mlflow.log_feedback).

In `@skills/eval-analyze/scripts/find_skills.py`:
- Around line 40-42: The current use of lstrip("./") on skills_field (in the
branch returning [skills_field.lstrip("./")] and the list comprehension
[s.lstrip("./") for s in skills_field]) wrongly strips any leading '.' or '/'
characters instead of removing the "./" or "../" prefix; update both branches to
explicitly remove only the "./" or "../" prefix (e.g. use str.removeprefix("./")
/ str.removeprefix("../") or check startswith and slice) so "../foo" becomes
"../foo" stripped to "foo" only by removing the actual prefix, and apply the
same fix to the list comprehension handling each s in skills_field.

In `@skills/eval-analyze/scripts/validate_eval.py`:
- Around line 77-82: The existence checks for prompt_file, each ctx_file from
j.get("context", []), and settings currently allow path traversal (e.g.,
../../file); update validation so each path is rejected if
Path(path).is_absolute() is True or if ".." in Path(path).parts, and
additionally resolve the path and ensure it is within the intended project root
(e.g., compare Path(path).resolve() against project_root.resolve() or check that
project_root in Path(path).resolve().parents) before appending errors/warnings;
apply this same validation logic to prompt_file, the loop over j.get("context",
[]), and the settings variable referenced in the nearby block.
- Around line 83-89: The current dynamic import in validate_eval.py (using
importlib.import_module(module) inside the block handling j.get("module"))
allows arbitrary modules to be imported from untrusted eval.yaml; instead,
replace the dynamic import with a safe validation: either check the module
string against a maintained allowlist of permitted module names or verify the
module's spec/origin is located inside the project package before importing
(e.g., use importlib.util.find_spec and ensure spec.loader and spec.origin point
to project files and not builtins/system paths); if the module is not allowed or
not under the project, append the same error to errors and do not import. Ensure
this logic is applied where module is set and avoid any unconditional
importlib.import_module(module) calls.

In `@skills/eval-run/scripts/score.py`:
- Line 557: The print call using an unnecessary f-string causes F541; locate the
print statement that currently reads print(f"No summary found. Run judges
first.", file=sys.stderr) in skills/eval-run/scripts/score.py and remove the f
prefix so it becomes a plain string literal; no other behavior changes are
needed.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 116-118: The printed case count uses len(case_dirs) which can be
higher than the actual processed cases because _read_input filters out
directories without valid inputs; update the status line to print the actual
number of entries produced by _read_input by using len(batch_entries) (reference
variables: case_dirs, batch_entries and function _read_input) so the output
reflects the true number of cases that will be included in batch.yaml.

---

Duplicate comments:
In `@agent_eval/agent/claude_code.py`:
- Around line 125-136: The timeout handling incorrectly uses a hardcoded 30s in
the proc.wait call and then reports timeout_s in the error; update the proc.wait
call to use the timeout_s parameter (instead of 30) so the TimeoutExpired branch
and the RunResult(stderr=...) message reflect the actual timeout value; locate
the proc.wait(...) call and the except subprocess.TimeoutExpired block in
claude_code.py (symbols: proc.wait, timeout_s, start, RunResult) and replace the
hardcoded literal with the timeout_s variable so the duration and stderr string
are consistent.
- Around line 137-141: The except Exception as e block that returns a RunResult
with stderr=str(e) is too broad; replace it by catching only the expected
exception types (for example RuntimeError, ValueError, OSError,
asyncio.TimeoutError or asyncio.CancelledError as appropriate for this function)
and convert those to the RunResult, while re-raising any other/unexpected
exceptions (and always let BaseExceptions like KeyboardInterrupt and SystemExit
propagate). Locate the except block that builds RunResult using start and only
handle the specific error classes your code can reasonably produce, using
"raise" to bubble up everything else.

In `@skills/eval-run/scripts/score.py`:
- Line 327: The code computes parallelism as min(len(case_ids), os.cpu_count()
or 4) which can be zero when case_ids is empty and will crash
ThreadPoolExecutor(max_workers=0); in the compare_runs flow either return early
when case_ids is empty (similar to cmd_judges) or ensure parallelism is at least
1 (e.g., set parallelism = max(1, min(len(case_ids), os.cpu_count() or 4)));
update the variable where it’s defined and any
ThreadPoolExecutor(max_workers=parallelism) usage accordingly to avoid creating
a pool with zero workers.

---

Nitpick comments:
In `@agent_eval/state.py`:
- Around line 89-93: Replace the ambiguous single-letter variable `l` in the
list comprehension inside the "read-ids" branch (the code that builds `ids` when
`cmd == "read-ids"`) with a clearer name like `line`; update the comprehension
to use `line.strip()` and the filter `if line.strip()` so the behavior of `ids =
[ ... ]` remains identical but is more readable and avoids confusion with `1`.

In `@CLAUDE.md`:
- Line 11: Add language specifiers to the two fenced code blocks: update the
architecture diagram fenced block (the "architecture diagram" block) to start
with ```text and update the CLI commands fenced block (the "CLI usage" block) to
start with ```bash so the diagram and shell commands render correctly with
syntax highlighting.

In `@README.md`:
- Line 7: Several fenced code blocks in README.md are missing language
specifiers; update each ``` ... ``` block that contains an ASCII diagram to use
```text and each block that contains CLI commands to use ```bash so Markdown
lint MD040 is satisfied and rendering improves; search for the unannotated
fenced code blocks (the plain ``` delimiters) and replace them with the
appropriate ```text or ```bash specifier depending on content.

In `@skills/eval-analyze/references/eval-yaml-template.md`:
- Around line 88-101: Add explicit language specifiers to the fenced code blocks
in eval-yaml-template.md: change the block that begins "Each case directory
contains:" to ```yaml and change the plain text example blocks (the "Bad —
vague, no specific field names:" example starting "Cases contain input files and
reference outputs." and the score/example block starting "Score 1: Missing most
requirements, major errors") to ```text; apply the same fix to the other similar
fenced block group around the later example (the block containing "Score 1:
Missing most requirements, major errors") so all code fences include a language
hint.

In `@skills/eval-analyze/scripts/find_skills.py`:
- Around line 43-44: Replace the silent "except Exception: pass" in
find_skills.py (the try/except around reading/parsing each plugin.json) with
targeted error handling: catch json.JSONDecodeError and log a warning that
includes the plugin file path and the exception details, and for other
unexpected exceptions catch Exception as e and log an error with context (file
name and exception) before continuing; ensure you use the module logger (or
logging.warning/logging.error) instead of swallowing errors so malformed
plugin.json is visible during debugging.

In `@skills/eval-analyze/SKILL.md`:
- Line 139: The one-liner using python3 -c with a raw '<skill-path>/SKILL.md'
can break if the path contains quotes or shell metacharacters; replace the
inline command with a safe approach such as passing the path as an argument
(e.g. python3 -c "import hashlib,sys; from pathlib import Path;
print(hashlib.sha256(Path(sys.argv[1]).read_bytes()).hexdigest()[:12])" --
"<skill-path>/SKILL.md") or recommend using a short helper script that accepts
the path as argv to avoid shell quoting issues; update the SKILL.md example to
show the argument-based command and note that callers should quote the path.

In `@skills/eval-run/scripts/collect.py`:
- Around line 106-127: Remove the redundant local import of the regex module
inside _group_files; instead rely on the existing top-level import (re) already
present, so delete the line "import re" within the _group_files function and
keep the rest of the logic unchanged; reference the function name _group_files
to locate where to remove the duplicate import.

In `@skills/eval-run/scripts/score.py`:
- Line 217: The ImportError raise in the mlflow import failure should chain the
original exception to preserve traceback; update the except block around the
mlflow[genai] import in score.py (the block that currently does `raise
ImportError("mlflow[genai] required for LLM judges")`) to capture the caught
exception (e.g., `except Exception as err:`) and re-raise using `raise
ImportError("mlflow[genai] required for LLM judges") from err` so the original
error is chained.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 0973d12b-5cf6-4278-9982-232f6153f859

📥 Commits

Reviewing files that changed from the base of the PR and between 9219ee1 and f797c06.

📒 Files selected for processing (31)
  • .claude-plugin/plugin.json
  • .gitignore
  • CLAUDE.md
  • README.md
  • agent_eval/__init__.py
  • agent_eval/agent/__init__.py
  • agent_eval/agent/base.py
  • agent_eval/agent/claude_code.py
  • agent_eval/config.py
  • agent_eval/mlflow/__init__.py
  • agent_eval/mlflow/experiment.py
  • agent_eval/state.py
  • eval.yaml
  • pyproject.toml
  • skills/eval-analyze/SKILL.md
  • skills/eval-analyze/prompts/analyze-skill.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • skills/eval-analyze/references/eval-yaml-template.md
  • skills/eval-analyze/scripts/find_skills.py
  • skills/eval-analyze/scripts/validate_eval.py
  • skills/eval-mlflow/SKILL.md
  • skills/eval-optimize/SKILL.md
  • skills/eval-run/SKILL.md
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-run/scripts/collect.py
  • skills/eval-run/scripts/execute.py
  • skills/eval-run/scripts/score.py
  • skills/eval-run/scripts/workspace.py
  • skills/eval-setup/SKILL.md
  • skills/eval-setup/scripts/check_env.py
✅ Files skipped from review due to trivial changes (13)
  • agent_eval/init.py
  • agent_eval/mlflow/init.py
  • .gitignore
  • .claude-plugin/plugin.json
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-analyze/prompts/analyze-skill.md
  • eval.yaml
  • skills/eval-run/scripts/execute.py
  • agent_eval/agent/base.py
  • skills/eval-mlflow/SKILL.md
  • skills/eval-run/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • pyproject.toml
  • skills/eval-optimize/SKILL.md
  • skills/eval-setup/SKILL.md

Comment on lines +126 to +133
prompt=j.get("prompt", ""),
prompt_file=j.get("prompt_file", ""),
context=j.get("context", []),
feedback_type=j.get("feedback_type", ""),
model=j.get("model", ""),
module=j.get("module", ""),
function=j.get("function", ""),
))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing path validation for prompt_file and context fields (CWE-22).

JudgeConfig.prompt_file and context accept path values from eval.yaml but aren't validated. If these are later used for file reads (per docstring, context is "File paths loaded as supplementary context"), a malicious config could read files outside the project via ../../etc/passwd.

Proposed fix
         for j in raw.get("judges", []):
+            prompt_file = j.get("prompt_file", "")
+            if prompt_file:
+                _validate_relative_path(prompt_file, "judges[].prompt_file")
+            context_paths = j.get("context", [])
+            for idx, ctx_path in enumerate(context_paths):
+                _validate_relative_path(ctx_path, f"judges[].context[{idx}]")
             config.judges.append(JudgeConfig(
                 name=j.get("name", ""),
                 description=j.get("description", ""),
                 check=j.get("check", ""),
                 prompt=j.get("prompt", ""),
-                prompt_file=j.get("prompt_file", ""),
-                context=j.get("context", []),
+                prompt_file=prompt_file,
+                context=context_paths,
                 feedback_type=j.get("feedback_type", ""),
                 model=j.get("model", ""),
                 module=j.get("module", ""),
                 function=j.get("function", ""),
             ))

As per coding guidelines, **/*.py: Validate file paths (prevent path traversal).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
prompt=j.get("prompt", ""),
prompt_file=j.get("prompt_file", ""),
context=j.get("context", []),
feedback_type=j.get("feedback_type", ""),
model=j.get("model", ""),
module=j.get("module", ""),
function=j.get("function", ""),
))
for j in raw.get("judges", []):
prompt_file = j.get("prompt_file", "")
if prompt_file:
_validate_relative_path(prompt_file, "judges[].prompt_file")
context_paths = j.get("context", [])
for idx, ctx_path in enumerate(context_paths):
_validate_relative_path(ctx_path, f"judges[].context[{idx}]")
config.judges.append(JudgeConfig(
name=j.get("name", ""),
description=j.get("description", ""),
check=j.get("check", ""),
prompt=j.get("prompt", ""),
prompt_file=prompt_file,
context=context_paths,
feedback_type=j.get("feedback_type", ""),
model=j.get("model", ""),
module=j.get("module", ""),
function=j.get("function", ""),
))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/config.py` around lines 126 - 133, Validate and sanitize any file
paths provided for JudgeConfig.prompt_file and each entry of JudgeConfig.context
when constructing the JudgeConfig: ensure values are strings, resolve them via a
canonicalizing API (e.g., Path.resolve()) and verify they are inside an allowed
base directory (compare resolved path prefixes) and not absolute / containing
traversal that escapes the base; if validation fails, reject the config or raise
a clear error. Implement this check in the code path that instantiates
JudgeConfig (where prompt_file and context are read) so every prompt_file and
each context list element is normalized and compared against the project/data
base before being stored in the JudgeConfig object.

Comment on lines +59 to +74
def log_feedback(trace_id: str, name: str, value, source_type: str = "CODE",
source_id: str = "agent-eval", rationale: str = ""):
"""Log feedback to a trace."""
try:
import mlflow
from mlflow.entities.assessment import AssessmentSource

mlflow.log_feedback(
trace_id=trace_id,
name=name,
value=value,
source=AssessmentSource(source_type=source_type, source_id=source_id),
rationale=rationale if rationale else None,
)
except Exception:
pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Silent exception suppression hides failures (CWE-755).

log_feedback catches all exceptions and does nothing. If mlflow is misconfigured, the trace ID is invalid, or the API fails, callers have no indication. Per context snippet 1, trace IDs may be mismatched—silent failures here make debugging impossible.

Proposed fix: log the exception
     try:
         import mlflow
         from mlflow.entities.assessment import AssessmentSource

         mlflow.log_feedback(
             trace_id=trace_id,
             name=name,
             value=value,
             source=AssessmentSource(source_type=source_type, source_id=source_id),
             rationale=rationale if rationale else None,
         )
-    except Exception:
-        pass
+    except ImportError:
+        pass  # MLflow not installed, skip silently
+    except Exception as e:
+        print(f"WARNING: failed to log feedback for trace {trace_id}: {e}",
+              file=sys.stderr)
🧰 Tools
🪛 Ruff (0.15.9)

[error] 73-74: try-except-pass detected, consider logging the exception

(S110)


[warning] 73-73: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/mlflow/experiment.py` around lines 59 - 74, The current
log_feedback function swallows all exceptions, hiding failures; update it to
catch exceptions as e and log the error instead of silently passing so issues
with mlflow, AssessmentSource, or invalid trace_id become visible—use the module
logger (e.g., logging.getLogger(__name__)) and call logger.exception or
logger.error including context (trace_id, name, source_type/source_id) when
mlflow.log_feedback or AssessmentSource construction fails; keep the existing
behavior of sending rationale=None when empty and do not rework the function
signature (function: log_feedback, symbols: mlflow, AssessmentSource,
mlflow.log_feedback).

Comment on lines +40 to +42
return [skills_field.lstrip("./")]
elif isinstance(skills_field, list):
return [s.lstrip("./") for s in skills_field]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

lstrip("./") removes individual characters, not the prefix.

lstrip strips any character in the given set, not the substring. "../foo".lstrip("./") yields "foo", incorrectly removing "../". Use removeprefix or explicit handling.

Fix
-                    return [skills_field.lstrip("./")]
+                    return [skills_field.removeprefix("./")]
                 elif isinstance(skills_field, list):
-                    return [s.lstrip("./") for s in skills_field]
+                    return [s.removeprefix("./") for s in skills_field]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return [skills_field.lstrip("./")]
elif isinstance(skills_field, list):
return [s.lstrip("./") for s in skills_field]
return [skills_field.removeprefix("./")]
elif isinstance(skills_field, list):
return [s.removeprefix("./") for s in skills_field]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/find_skills.py` around lines 40 - 42, The current
use of lstrip("./") on skills_field (in the branch returning
[skills_field.lstrip("./")] and the list comprehension [s.lstrip("./") for s in
skills_field]) wrongly strips any leading '.' or '/' characters instead of
removing the "./" or "../" prefix; update both branches to explicitly remove
only the "./" or "../" prefix (e.g. use str.removeprefix("./") /
str.removeprefix("../") or check startswith and slice) so "../foo" becomes
"../foo" stripped to "foo" only by removing the actual prefix, and apply the
same fix to the list comprehension handling each s in skills_field.

Comment on lines +77 to +82
prompt_file = j.get("prompt_file", "")
if prompt_file and not Path(prompt_file).exists():
errors.append(f"judges.{name}.prompt_file '{prompt_file}' not found")
for ctx_file in j.get("context", []):
if not Path(ctx_file).exists():
warnings.append(f"judges.{name}.context '{ctx_file}' not found")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

[CWE-22] Missing path traversal validation for prompt_file, context, and settings.

These paths are checked for existence but not constrained to the project root. prompt_file: ../../.env would pass validation. Apply the same .is_absolute() and ".." in parts checks used for output paths.

+    def _is_safe_relative_path(p):
+        pp = Path(p)
+        return not pp.is_absolute() and ".." not in pp.parts
+
     for j in judges:
         name = j.get("name", "unnamed")
         prompt_file = j.get("prompt_file", "")
-        if prompt_file and not Path(prompt_file).exists():
+        if prompt_file:
+            if not _is_safe_relative_path(prompt_file):
+                errors.append(f"judges.{name}.prompt_file must be relative without parent traversal")
+            elif not Path(prompt_file).exists():
                 errors.append(f"judges.{name}.prompt_file '{prompt_file}' not found")

As per coding guidelines, "Validate file paths (prevent path traversal)".

Also applies to: 91-93

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/validate_eval.py` around lines 77 - 82, The
existence checks for prompt_file, each ctx_file from j.get("context", []), and
settings currently allow path traversal (e.g., ../../file); update validation so
each path is rejected if Path(path).is_absolute() is True or if ".." in
Path(path).parts, and additionally resolve the path and ensure it is within the
intended project root (e.g., compare Path(path).resolve() against
project_root.resolve() or check that project_root in
Path(path).resolve().parents) before appending errors/warnings; apply this same
validation logic to prompt_file, the loop over j.get("context", []), and the
settings variable referenced in the nearby block.

Comment on lines +83 to +89
module = j.get("module", "")
if module:
try:
import importlib
importlib.import_module(module)
except ImportError:
errors.append(f"judges.{name}.module '{module}' not importable")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

[CWE-94] importlib.import_module with config-controlled module enables arbitrary code import.

A malicious eval.yaml can set module: "os" to import arbitrary modules. Import itself can execute __init__.py code. Restrict to an allowlist or validate the module path is within the project.

Suggested mitigation
         module = j.get("module", "")
         if module:
+            # Only allow project-local modules (e.g., "eval.judges.foo")
+            if not module.startswith("eval.") and "." not in module:
+                errors.append(f"judges.{name}.module '{module}' must be a project-local module (e.g., 'eval.judges.foo')")
+                continue
             try:
                 import importlib
                 importlib.import_module(module)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/validate_eval.py` around lines 83 - 89, The
current dynamic import in validate_eval.py (using
importlib.import_module(module) inside the block handling j.get("module"))
allows arbitrary modules to be imported from untrusted eval.yaml; instead,
replace the dynamic import with a safe validation: either check the module
string against a maintained allowlist of permitted module names or verify the
module's spec/origin is located inside the project package before importing
(e.g., use importlib.util.find_spec and ensure spec.loader and spec.origin point
to project files and not builtins/system paths); if the module is not allowed or
not under the project, append the same error to errors and do not import. Ensure
this logic is applied where module is set and avoid any unconditional
importlib.import_module(module) calls.

Comment on lines +116 to +118
print(f"WORKSPACE: {workspace}")
print(f"CASES: {len(case_dirs)}")
print(f"BATCH: {workspace / 'batch.yaml'}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Case count may be misleading.

Line 117 prints len(case_dirs) but some cases may be skipped in _read_input (no valid input file). Consider printing len(batch_entries) for accuracy.

Proposed fix
     print(f"WORKSPACE: {workspace}")
-    print(f"CASES: {len(case_dirs)}")
+    print(f"CASES: {len(batch_entries)}")
     print(f"BATCH: {workspace / 'batch.yaml'}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 116 - 118, The printed
case count uses len(case_dirs) which can be higher than the actual processed
cases because _read_input filters out directories without valid inputs; update
the status line to print the actual number of entries produced by _read_input by
using len(batch_entries) (reference variables: case_dirs, batch_entries and
function _read_input) so the output reflects the true number of cases that will
be included in batch.yaml.

@astefanutti astefanutti force-pushed the pr-01 branch 2 times, most recently from d53362b to b7615d1 Compare April 5, 2026 14:27
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (5)
agent_eval/agent/claude_code.py (2)

137-141: ⚠️ Potential issue | 🟡 Minor

Bare except Exception masks KeyboardInterrupt and obscures root causes (CWE-755).

Narrow to expected exceptions (OSError, subprocess.SubprocessError, TimeoutExpired). This was flagged in a prior review and remains unaddressed.

Proposed fix
-        except Exception as e:
+        except (OSError, subprocess.SubprocessError) as e:
             duration = time.monotonic() - start
             return RunResult(
                 exit_code=-1, stdout="", stderr=str(e), duration_s=duration,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 137 - 141, The current broad
"except Exception as e" in the block that computes duration with
time.monotonic() and returns a RunResult should be narrowed to only the expected
error types (e.g., OSError, subprocess.SubprocessError, and
subprocess.TimeoutExpired) so KeyboardInterrupt and other unexpected exceptions
propagate; update the except clause(s) around the subprocess
invocation/RunResult return to catch those specific exceptions, keep the
duration calculation and the RunResult(exit_code=-1, stdout="", stderr=str(e),
duration_s=duration) behavior, and do not swallow KeyboardInterrupt or other
non-subprocess errors.

51-53: ⚠️ Potential issue | 🔴 Critical

timeout_s parameter declared but ignored; hardcoded 30s timeout used instead.

Line 125 uses proc.wait(timeout=30) while the method signature accepts timeout_s=600. The error message at line 134 references timeout_s but the actual timeout is 30 seconds.

Proposed fix
-            proc.wait(timeout=30)
+            proc.wait(timeout=timeout_s)

Also applies to: 125-136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 51 - 53, The function currently
ignores the timeout_s parameter by calling proc.wait(timeout=30) and then
referencing timeout_s in the error text; update the code to use the timeout_s
variable (seconds) when waiting on the subprocess (proc.wait(timeout=timeout_s))
and adjust the timeout error handling/log message to use that same timeout_s
value so the behavior matches the signature and RunResult uses the correct
timeout duration.
skills/eval-run/scripts/score.py (3)

261-273: ⚠️ Potential issue | 🔴 Critical

[CWE-94] exec() with config-sourced code enables arbitrary code execution.

_make_inline_check compiles and executes jc.check from eval.yaml with full __builtins__. A malicious config can run os.system('rm -rf /'), exfiltrate secrets via network, or compromise the host. This was flagged in prior review and remains unaddressed.

Either remove inline check support, require explicit module:function references, or execute in a sandboxed subprocess with no filesystem/network access.

As per coding guidelines, "No eval() or exec() with untrusted input (CWE-94)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 261 - 273, The
`_make_inline_check` implementation executes untrusted `jc.check` code via
compile/exec with full `__builtins__`, enabling arbitrary code execution;
replace inline execution with a safe alternative by rejecting/disabled inline
scripts and requiring `jc.check` to be a `module:function` reference instead:
update `_make_inline_check` to parse `jc.check` as "module:callable", import the
module via importlib, fetch the callable with getattr (or raise a clear error if
format/callable is invalid), and return a scorer that invokes that callable (use
`scorer` and `check_fn` names to preserve call sites). If inline support must
remain, instead run the compiled code inside a sandboxed subprocess with no
filesystem/network access and minimal env, capturing outputs and errors—do not
use exec with provided `ns` or full `__builtins__`.

627-627: ⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix (F541).

f"No summary found. Run judges first." has no placeholders.

-        print(f"No summary found. Run judges first.", file=sys.stderr)
+        print("No summary found. Run judges first.", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` at line 627, The print call in score.py
uses an unnecessary f-string which triggers F541; change the statement that
prints "No summary found. Run judges first." from using f"No summary found. Run
judges first." to a regular string literal without the f prefix (locate the
print(...) invocation in the scoring/summary check and remove the leading 'f').

397-402: ⚠️ Potential issue | 🟡 Minor

Empty case_ids causes parallelism=0, failing ThreadPoolExecutor.

compare_runs doesn't guard against empty case_ids. When empty, min(0, cpu_count) yields 0, and ThreadPoolExecutor(max_workers=0) raises ValueError.

Proposed fix
 def compare_runs(run_a_dir, run_b_dir, config, case_ids,
                  prompt=None, prompt_file=None, model="claude-sonnet-4-6"):
     """Compare two runs using position-swapped LLM judge."""
+    if not case_ids:
+        return {"run_a": run_a_dir.name, "run_b": run_b_dir.name,
+                "cases_compared": 0, "wins_a": 0, "wins_b": 0,
+                "ties": 0, "errors": 0, "per_case": []}
     comparison_prompt = prompt
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 397 - 402, The code in
compare_runs computes parallelism = min(len(case_ids), os.cpu_count() or 4)
which becomes 0 for empty case_ids and causes ThreadPoolExecutor(max_workers=0)
to raise; update compare_runs to guard empty case_ids by checking if not
case_ids and returning an empty results list (or otherwise set parallelism =
max(1, min(len(case_ids), os.cpu_count() or 4))) before creating the
ThreadPoolExecutor to ensure max_workers is at least 1; locate the parallelism
variable and ThreadPoolExecutor usage in compare_runs and implement the
early-return or min-one fix there.
🧹 Nitpick comments (12)
agent_eval/state.py (1)

92-92: Ambiguous variable name l.

Per PEP8/E741, l is easily confused with 1. Rename to line.

-            ids = [l.strip() for l in path.read_text().splitlines() if l.strip()]
+            ids = [line.strip() for line in path.read_text().splitlines() if line.strip()]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/state.py` at line 92, The list comprehension assigning to ids uses
the ambiguous loop variable `l`; rename it to `line` in the expression `ids =
[l.strip() for l in path.read_text().splitlines() if l.strip()]` so it becomes
`ids = [line.strip() for line in path.read_text().splitlines() if
line.strip()]`—update any matching occurrences in the same statement to use
`line` to satisfy PEP8/E741 and avoid confusion with `1`.
skills/eval-dataset/SKILL.md (1)

79-85: Missing language specifier on fenced code block.

Add language hint for syntax highlighting and linter compliance.

-```
+```text
 case-001-simple-basic-input/
 case-002-complex-multi-requirement/
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-dataset/SKILL.md` around lines 79 - 85, The fenced code block
containing the test case list (lines showing "case-001-simple-basic-input/"
through "case-005-ambiguous-phrasing/") is missing a language specifier; update
the opening triple backticks to include a language hint (e.g., ```text) so the
block becomes a labeled code fence (affects the block that contains the case-00x
entries in SKILL.md).
skills/eval-setup/scripts/check_env.py (1)

56-58: Consider redacting infrastructure identifiers in output.

Logging ANTHROPIC_VERTEX_PROJECT_ID value exposes GCP project IDs in terminal output or logs. While not a secret, it's an infrastructure identifier that could aid reconnaissance. Consider showing only a truncated/masked value or just "set".

Suggestion
     if has_vertex:
-        api_detail.append(f"ANTHROPIC_VERTEX_PROJECT_ID={os.environ['ANTHROPIC_VERTEX_PROJECT_ID']}")
+        api_detail.append("ANTHROPIC_VERTEX_PROJECT_ID set")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-setup/scripts/check_env.py` around lines 56 - 58, The code
currently appends the full ANTHROPIC_VERTEX_PROJECT_ID to api_detail (see the
has_vertex branch and the api_detail.append call), which leaks infra
identifiers; change that append to either a redacted/truncated form (e.g., show
only first 4 and last 4 chars with ellipsis) or simply indicate
"ANTHROPIC_VERTEX_PROJECT_ID set" instead of the full environment value so the
identifier is not printed in logs.
skills/eval-run/scripts/collect.py (1)

112-112: Redundant import statement.

import re on line 112 is unnecessary; re is already imported at module level (line 16).

 def _group_files(files, num_cases):
     """Group files into per-case bundles.
 
     Tries to detect a common ID prefix pattern (e.g., "RFE-001", "TASK-002").
     If found, groups by prefix. Otherwise, distributes one file per case.
     """
-    import re
 
     # Try to find a common prefix pattern: WORD-NNN at start of filename
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/collect.py` at line 112, There is a duplicate import
of the regex module 're' in the same module; remove the redundant "import re"
statement (the second occurrence in skills/eval-run/scripts/collect.py) so only
the single module-level import remains, leaving any other references to 're'
unchanged.
skills/eval-run/scripts/workspace.py (1)

144-146: Broad exception catch could mask unexpected errors.

Line 144 catches all Exception types. Consider narrowing to (yaml.YAMLError, json.JSONDecodeError, OSError) to avoid masking programming errors.

-        except Exception as e:
+        except (yaml.YAMLError, json.JSONDecodeError, OSError) as e:
             print(f"WARNING: failed to parse {name}: {e}", file=sys.stderr)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/workspace.py` around lines 144 - 146, The except
block that currently does "except Exception as e:" when parsing files (printing
"WARNING: failed to parse {name}") is too broad; replace it with a narrower
catch for parsing/IO errors such as (yaml.YAMLError, json.JSONDecodeError,
OSError) so programming errors aren't masked, and ensure those exception classes
are imported or referenced by module (e.g., import yaml, import json) before
use; keep the existing warning print and the continue behavior unchanged.
skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.yaml (1)

180-200: Missing explicit encoding in file open calls.

Lines 181, 217, 310, 318, 324 open files without specifying encoding='utf-8'. On systems with non-UTF8 default locale, this can cause UnicodeDecodeError or misread content.

Suggested fix pattern
-          with open(f) as fh:
+          with open(f, encoding='utf-8') as fh:
               content = fh.read()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.yaml`
around lines 180 - 200, The file opens files using plain with open(f) as fh:
(inside the task_files loop that reads frontmatter and other similar open calls)
which can fail on non-UTF8 locales; update each plain open(...) call to
explicitly specify encoding='utf-8' (e.g., with open(f, encoding='utf-8') as
fh:) wherever files are read (the open used to read task_files contents and the
other open calls noted in the review) so YAML parsing and string checks behave
consistently across environments.
skills/eval-run/scripts/tools.py (1)

23-26: No error handling for malformed JSON on stdin.

If stdin is empty or contains invalid JSON, json.load() raises an exception. For a hook script, consider catching json.JSONDecodeError and exiting cleanly to avoid breaking the agent flow.

Suggested fix
 def main():
-    input_data = json.load(sys.stdin)
+    try:
+        input_data = json.load(sys.stdin)
+    except json.JSONDecodeError:
+        sys.exit(0)  # Malformed input — pass through
     tool_name = input_data.get("tool_name", "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/tools.py` around lines 23 - 26, The main function
currently calls json.load(sys.stdin) without handling malformed or empty stdin;
wrap the json.load call in a try/except that catches json.JSONDecodeError (and
optionally ValueError) around the json.load invocation in main and on decode
failure log a clear error (e.g., to stderr or processLogger) and exit cleanly
(os._exit or sys.exit with non-zero code) so the hook script doesn't raise an
unhandled exception; refer to main and the json.load(sys.stdin) call when making
the change.
skills/eval-analyze/references/eval-yaml-template.md (2)

149-155: Add language specifier.

Line 149 code block should specify a language (e.g., text or markdown).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/references/eval-yaml-template.md` around lines 149 - 155,
The fenced code block that lists the score descriptions (the block containing
"Score 1: Missing most requirements..." through "Score 5: Comprehensive,
accurate, well-written") is missing a language specifier; update the opening
fence from ``` to something like ```text or ```markdown so the block is
explicitly labeled (modify the code block that contains the "Score 1"–"Score 5"
lines).

117-130: Add language specifiers to fenced code blocks.

Lines 117-125 and 128-130 use fenced code blocks without language identifiers. Specify yaml for proper syntax highlighting and linting.

Suggested fix
-```
+```yaml
 Each case directory contains:
 - input.yaml: YAML file with 'prompt' (the problem statement to send
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/references/eval-yaml-template.md` around lines 117 - 130,
Update the two fenced code blocks in eval-yaml-template.md (the positive example
block showing input.yaml/reference.md/annotations.yaml and the "Bad" example
block) to include language specifiers so syntax highlighting/linting works;
replace the opening triple-backticks with triple-backticks plus "yaml" (e.g.,
```yaml) for the positive example and add a language tag appropriate for the
"Bad" snippet (e.g., ```text or ```yaml) so both fenced blocks declare a
language. Locate the fenced blocks by the sample text mentioning "input.yaml",
"reference.md", and "annotations.yaml" and the following "Bad — vague" block and
add the language identifiers to their opening fences.
agent_eval/agent/claude_code.py (1)

178-183: Class attribute _SAFE_ENV_KEYS should be a ClassVar or frozenset.

Ruff RUF012 flags mutable default for class attribute. Since this set is never mutated, declare as frozenset for correctness and to silence the lint.

Proposed fix
-    _SAFE_ENV_KEYS = {
+    _SAFE_ENV_KEYS: frozenset[str] = frozenset({
         "PATH", "HOME", "USER", "SHELL", "LANG", "LC_ALL", "TERM",
         "ANTHROPIC_API_KEY", "ANTHROPIC_VERTEX_PROJECT_ID", "CLOUD_ML_REGION",
         "MLFLOW_TRACKING_URI", "MLFLOW_EXPERIMENT_NAME",
         "CLAUDE_CODE_SUBAGENT_MODEL",
-    }
+    })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/agent/claude_code.py` around lines 178 - 183, The class attribute
_SAFE_ENV_KEYS is defined as a mutable set which triggers RUF012; change its
declaration to an immutable frozenset (or annotate it as
ClassVar[frozenset[str]]) so the value cannot be mutated and the linter warning
is silenced — update the _SAFE_ENV_KEYS symbol to use frozenset({...}) (and add
a ClassVar annotation if your typing style requires it) in claude_code.py where
the constant is defined.
skills/eval-run/scripts/score.py (1)

284-287: Use raise ... from err for exception chaining (B904).

When re-raising within an except block, chain the original exception to preserve traceback context.

     except ImportError:
-        raise ImportError("mlflow[genai] required for LLM judges")
+        raise ImportError("mlflow[genai] required for LLM judges") from None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/score.py` around lines 284 - 287, The ImportError
handler around the import of make_judge should preserve the original exception
chain: change the except clause to capture the exception (e.g., except
ImportError as err) and re-raise the new ImportError("mlflow[genai] required for
LLM judges") using "from err" so the original traceback is retained; update the
try/except that imports from mlflow.genai.judges (make_judge) accordingly.
README.md (1)

7-7: Add language identifiers to fenced code blocks (MD040).

Multiple code blocks lack language specifiers. Use text for ASCII diagrams and command examples, bash for shell commands.

Example fixes:

-```
+```text
                                              ┌──────────────────┐
-```
+```text
 /eval-setup

Also applies to: 68-70, 76-78, 87-89, 95-97

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 7, Several fenced code blocks in README.md are missing
language identifiers (MD040); update each triple-backtick block to include an
appropriate language tag (use "text" for ASCII diagrams and path examples and
"bash" for shell/command examples). Locate the plain ``` blocks shown in the
diff and at the other reported ranges (around lines 68-70, 76-78, 87-89, 95-97)
and change them to ```text or ```bash as appropriate so syntax highlighters and
linters recognize the content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent_eval/config.py`:
- Around line 164-168: The config is appending unvalidated paths into
ToolInputConfig.prompt_file; before creating or appending ToolInputConfig in the
block that calls config.inputs.tools.append, validate prompt_file (when
non-empty) by normalizing it (reject absolute paths and remove/resolve .. using
os.path.abspath or equivalent) and ensure its absolute path is within an allowed
base directory (e.g., the project config directory) using os.path.commonpath; if
the resolved path is outside the base or contains traversal, raise/return a
config validation error and do not create the ToolInputConfig instance. Ensure
the validation logic is applied where ToolInputConfig(prompt_file=...) is
constructed so all inputs[].prompt_file values are checked.

In
`@skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.md`:
- Around line 65-75: The fenced code block containing the rfe.speedrun pipeline
(the block that starts with ``` and lists Phase 1: rfe.create, Phase 2:
rfe.auto-fix, etc.) must include a language identifier to satisfy MD040; change
the opening fence from ``` to ```text (or another appropriate language) so the
block becomes ```text ... ```.

In
`@skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/outputs/eval.yaml`:
- Around line 1-5: The file contains duplicated line-number artifacts before
each line (e.g., "1\tname: rfe-creator" or "1~1\t..."); remove the leading
numeric annotations so the YAML uses plain keys (e.g., "name: rfe-creator",
"description: ...", "skill: rfe.speedrun", "runner: claude-code") ensuring no
stray tabs or markers remain and the file parses as valid YAML; update the
content around the "name", "description", "skill", and "runner" entries
accordingly.
- Line 46: Replace the mojibake sequence "â" in the comment containing "Dataset
� each case is a derived problem statement from an existing Jira RFE" with a
proper em dash so the line reads "Dataset — each case is a derived problem
statement from an existing Jira RFE"; search for the exact corrupted substring
"â��" in eval.yaml and update it to "—" preserving the rest of the comment text
and encoding (UTF-8).

In `@skills/eval-analyze/scripts/find_skills.py`:
- Around line 75-77: The code assumes a closing YAML delimiter when handling
frontmatter and uses content.split("---")[1], which raises IndexError for
malformed frontmatter; change the logic in the block that checks
content.startswith("---") to safely extract the frontmatter (e.g., use
content.split("---", 2) and verify the result has at least 3 parts or use
content.partition("---")/find to locate the closing delimiter) before calling
yaml.safe_load, and only then set desc from fm to avoid the IndexError and
produce a clearer error path; update the variables fm and desc in that branch
accordingly.

In `@skills/eval-analyze/SKILL.md`:
- Around line 139-141: The inline Python command that computes the SHA-256 of
SKILL.md uses unsafe string interpolation of '<skill-path>' and can break with
quotes/backslashes; change the step to safely handle arbitrary paths by either
(a) quoting the path with shlex.quote before embedding or (b) passing the path
as a separate argument to the Python -c invocation, or (c) replace the one-liner
with the safe Path-based approach used in validate_eval.py (e.g. use
find_skill() to construct a Path and compute the hash), updating the SKILL.md
hash step accordingly.

In `@skills/eval-run/scripts/tools.py`:
- Around line 93-95: The code assumes options[0]["label"] exists and can raise
KeyError; update the logic where options and answer are set (the block using
variables options, answer and assigning to answers[text]) to safely access the
label (e.g., use options[0].get("label", ...) or check for a "label" key) and
provide a sensible default ("yes") when missing, ensuring no exception is thrown
if options[0] is missing or lacks "label".
- Around line 28-34: The code currently uses a relative
Path("tool_handlers.yaml") (symbol: config_path) which can fail if the current
working directory isn't the workspace root; change the lookup to derive the
config file location from the script file location (use
Path(__file__).resolve().parent combined with "tool_handlers.yaml") so the
exists check and open(...) for yaml.safe_load(...) always target the config next
to tools.py rather than the CWD; update the config_path assignment in tools.py
accordingly and keep the subsequent exists check and file open logic as-is.

---

Duplicate comments:
In `@agent_eval/agent/claude_code.py`:
- Around line 137-141: The current broad "except Exception as e" in the block
that computes duration with time.monotonic() and returns a RunResult should be
narrowed to only the expected error types (e.g., OSError,
subprocess.SubprocessError, and subprocess.TimeoutExpired) so KeyboardInterrupt
and other unexpected exceptions propagate; update the except clause(s) around
the subprocess invocation/RunResult return to catch those specific exceptions,
keep the duration calculation and the RunResult(exit_code=-1, stdout="",
stderr=str(e), duration_s=duration) behavior, and do not swallow
KeyboardInterrupt or other non-subprocess errors.
- Around line 51-53: The function currently ignores the timeout_s parameter by
calling proc.wait(timeout=30) and then referencing timeout_s in the error text;
update the code to use the timeout_s variable (seconds) when waiting on the
subprocess (proc.wait(timeout=timeout_s)) and adjust the timeout error
handling/log message to use that same timeout_s value so the behavior matches
the signature and RunResult uses the correct timeout duration.

In `@skills/eval-run/scripts/score.py`:
- Around line 261-273: The `_make_inline_check` implementation executes
untrusted `jc.check` code via compile/exec with full `__builtins__`, enabling
arbitrary code execution; replace inline execution with a safe alternative by
rejecting/disabled inline scripts and requiring `jc.check` to be a
`module:function` reference instead: update `_make_inline_check` to parse
`jc.check` as "module:callable", import the module via importlib, fetch the
callable with getattr (or raise a clear error if format/callable is invalid),
and return a scorer that invokes that callable (use `scorer` and `check_fn`
names to preserve call sites). If inline support must remain, instead run the
compiled code inside a sandboxed subprocess with no filesystem/network access
and minimal env, capturing outputs and errors—do not use exec with provided `ns`
or full `__builtins__`.
- Line 627: The print call in score.py uses an unnecessary f-string which
triggers F541; change the statement that prints "No summary found. Run judges
first." from using f"No summary found. Run judges first." to a regular string
literal without the f prefix (locate the print(...) invocation in the
scoring/summary check and remove the leading 'f').
- Around line 397-402: The code in compare_runs computes parallelism =
min(len(case_ids), os.cpu_count() or 4) which becomes 0 for empty case_ids and
causes ThreadPoolExecutor(max_workers=0) to raise; update compare_runs to guard
empty case_ids by checking if not case_ids and returning an empty results list
(or otherwise set parallelism = max(1, min(len(case_ids), os.cpu_count() or 4)))
before creating the ThreadPoolExecutor to ensure max_workers is at least 1;
locate the parallelism variable and ThreadPoolExecutor usage in compare_runs and
implement the early-return or min-one fix there.

---

Nitpick comments:
In `@agent_eval/agent/claude_code.py`:
- Around line 178-183: The class attribute _SAFE_ENV_KEYS is defined as a
mutable set which triggers RUF012; change its declaration to an immutable
frozenset (or annotate it as ClassVar[frozenset[str]]) so the value cannot be
mutated and the linter warning is silenced — update the _SAFE_ENV_KEYS symbol to
use frozenset({...}) (and add a ClassVar annotation if your typing style
requires it) in claude_code.py where the constant is defined.

In `@agent_eval/state.py`:
- Line 92: The list comprehension assigning to ids uses the ambiguous loop
variable `l`; rename it to `line` in the expression `ids = [l.strip() for l in
path.read_text().splitlines() if l.strip()]` so it becomes `ids = [line.strip()
for line in path.read_text().splitlines() if line.strip()]`—update any matching
occurrences in the same statement to use `line` to satisfy PEP8/E741 and avoid
confusion with `1`.

In `@README.md`:
- Line 7: Several fenced code blocks in README.md are missing language
identifiers (MD040); update each triple-backtick block to include an appropriate
language tag (use "text" for ASCII diagrams and path examples and "bash" for
shell/command examples). Locate the plain ``` blocks shown in the diff and at
the other reported ranges (around lines 68-70, 76-78, 87-89, 95-97) and change
them to ```text or ```bash as appropriate so syntax highlighters and linters
recognize the content.

In
`@skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.yaml`:
- Around line 180-200: The file opens files using plain with open(f) as fh:
(inside the task_files loop that reads frontmatter and other similar open calls)
which can fail on non-UTF8 locales; update each plain open(...) call to
explicitly specify encoding='utf-8' (e.g., with open(f, encoding='utf-8') as
fh:) wherever files are read (the open used to read task_files contents and the
other open calls noted in the review) so YAML parsing and string checks behave
consistently across environments.

In `@skills/eval-analyze/references/eval-yaml-template.md`:
- Around line 149-155: The fenced code block that lists the score descriptions
(the block containing "Score 1: Missing most requirements..." through "Score 5:
Comprehensive, accurate, well-written") is missing a language specifier; update
the opening fence from ``` to something like ```text or ```markdown so the block
is explicitly labeled (modify the code block that contains the "Score 1"–"Score
5" lines).
- Around line 117-130: Update the two fenced code blocks in
eval-yaml-template.md (the positive example block showing
input.yaml/reference.md/annotations.yaml and the "Bad" example block) to include
language specifiers so syntax highlighting/linting works; replace the opening
triple-backticks with triple-backticks plus "yaml" (e.g., ```yaml) for the
positive example and add a language tag appropriate for the "Bad" snippet (e.g.,
```text or ```yaml) so both fenced blocks declare a language. Locate the fenced
blocks by the sample text mentioning "input.yaml", "reference.md", and
"annotations.yaml" and the following "Bad — vague" block and add the language
identifiers to their opening fences.

In `@skills/eval-dataset/SKILL.md`:
- Around line 79-85: The fenced code block containing the test case list (lines
showing "case-001-simple-basic-input/" through "case-005-ambiguous-phrasing/")
is missing a language specifier; update the opening triple backticks to include
a language hint (e.g., ```text) so the block becomes a labeled code fence
(affects the block that contains the case-00x entries in SKILL.md).

In `@skills/eval-run/scripts/collect.py`:
- Line 112: There is a duplicate import of the regex module 're' in the same
module; remove the redundant "import re" statement (the second occurrence in
skills/eval-run/scripts/collect.py) so only the single module-level import
remains, leaving any other references to 're' unchanged.

In `@skills/eval-run/scripts/score.py`:
- Around line 284-287: The ImportError handler around the import of make_judge
should preserve the original exception chain: change the except clause to
capture the exception (e.g., except ImportError as err) and re-raise the new
ImportError("mlflow[genai] required for LLM judges") using "from err" so the
original traceback is retained; update the try/except that imports from
mlflow.genai.judges (make_judge) accordingly.

In `@skills/eval-run/scripts/tools.py`:
- Around line 23-26: The main function currently calls json.load(sys.stdin)
without handling malformed or empty stdin; wrap the json.load call in a
try/except that catches json.JSONDecodeError (and optionally ValueError) around
the json.load invocation in main and on decode failure log a clear error (e.g.,
to stderr or processLogger) and exit cleanly (os._exit or sys.exit with non-zero
code) so the hook script doesn't raise an unhandled exception; refer to main and
the json.load(sys.stdin) call when making the change.

In `@skills/eval-run/scripts/workspace.py`:
- Around line 144-146: The except block that currently does "except Exception as
e:" when parsing files (printing "WARNING: failed to parse {name}") is too
broad; replace it with a narrower catch for parsing/IO errors such as
(yaml.YAMLError, json.JSONDecodeError, OSError) so programming errors aren't
masked, and ensure those exception classes are imported or referenced by module
(e.g., import yaml, import json) before use; keep the existing warning print and
the continue behavior unchanged.

In `@skills/eval-setup/scripts/check_env.py`:
- Around line 56-58: The code currently appends the full
ANTHROPIC_VERTEX_PROJECT_ID to api_detail (see the has_vertex branch and the
api_detail.append call), which leaks infra identifiers; change that append to
either a redacted/truncated form (e.g., show only first 4 and last 4 chars with
ellipsis) or simply indicate "ANTHROPIC_VERTEX_PROJECT_ID set" instead of the
full environment value so the identifier is not printed in logs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 156a6f30-30a8-4297-aae7-0d2649b60780

📥 Commits

Reviewing files that changed from the base of the PR and between f797c06 and d53362b.

📒 Files selected for processing (49)
  • .claude-plugin/plugin.json
  • .gitignore
  • CLAUDE.md
  • README.md
  • agent_eval/__init__.py
  • agent_eval/agent/__init__.py
  • agent_eval/agent/base.py
  • agent_eval/agent/claude_code.py
  • agent_eval/config.py
  • agent_eval/mlflow/__init__.py
  • agent_eval/mlflow/experiment.py
  • agent_eval/state.py
  • eval.yaml
  • pyproject.toml
  • skills/eval-analyze-workspace/evals/evals.json
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/eval_metadata.json
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/with_skill/outputs/eval.md
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/with_skill/outputs/eval.yaml
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/with_skill/timing.json
  • skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/eval_metadata.json
  • skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.md
  • skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.yaml
  • skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/timing.json
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/eval_metadata.json
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/outputs/eval.md
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/outputs/eval.yaml
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/outputs/eval.yaml.before
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/timing.json
  • skills/eval-analyze/SKILL.md
  • skills/eval-analyze/prompts/analyze-skill.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • skills/eval-analyze/references/eval-yaml-template.md
  • skills/eval-analyze/scripts/find_skills.py
  • skills/eval-analyze/scripts/validate_eval.py
  • skills/eval-dataset/SKILL.md
  • skills/eval-mlflow/SKILL.md
  • skills/eval-optimize/SKILL.md
  • skills/eval-review/SKILL.md
  • skills/eval-review/prompts/review-results.md
  • skills/eval-run/SKILL.md
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-run/scripts/collect.py
  • skills/eval-run/scripts/execute.py
  • skills/eval-run/scripts/score.py
  • skills/eval-run/scripts/tools.py
  • skills/eval-run/scripts/workspace.py
  • skills/eval-setup/SKILL.md
  • skills/eval-setup/scripts/check_env.py
✅ Files skipped from review due to trivial changes (21)
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/timing.json
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/with_skill/timing.json
  • agent_eval/mlflow/init.py
  • agent_eval/init.py
  • skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/timing.json
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/eval_metadata.json
  • skills/eval-analyze-workspace/evals/evals.json
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/eval_metadata.json
  • skills/eval-run/prompts/comparison-judge.md
  • skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/eval_metadata.json
  • .gitignore
  • .claude-plugin/plugin.json
  • skills/eval-run/prompts/analyze-results.md
  • skills/eval-analyze-workspace/iteration-1/test-3-update-existing/with_skill/outputs/eval.md
  • skills/eval-review/prompts/review-results.md
  • skills/eval-analyze/prompts/generate-eval-md.md
  • pyproject.toml
  • eval.yaml
  • skills/eval-analyze/prompts/analyze-skill.md
  • skills/eval-mlflow/SKILL.md
  • skills/eval-analyze-workspace/iteration-1/test-1-simple-skill/with_skill/outputs/eval.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • skills/eval-run/scripts/execute.py
  • agent_eval/agent/base.py
  • skills/eval-setup/SKILL.md
  • skills/eval-run/SKILL.md
  • skills/eval-analyze/scripts/validate_eval.py
  • CLAUDE.md

Comment on lines +164 to +168
config.inputs.tools.append(ToolInputConfig(
match=t.get("match", ""),
prompt=t.get("prompt", ""),
prompt_file=t.get("prompt_file", ""),
))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing path validation for inputs[].prompt_file (CWE-22).

ToolInputConfig.prompt_file accepts path from YAML without validation. If later used for file reads, a malicious config could traverse outside the project.

Fix
         for t in (inputs_raw.get("tools") or []):
+            prompt_file = t.get("prompt_file", "")
+            if prompt_file:
+                _validate_relative_path(prompt_file, "inputs.tools[].prompt_file")
             config.inputs.tools.append(ToolInputConfig(
                 match=t.get("match", ""),
                 prompt=t.get("prompt", ""),
-                prompt_file=t.get("prompt_file", ""),
+                prompt_file=prompt_file,
             ))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent_eval/config.py` around lines 164 - 168, The config is appending
unvalidated paths into ToolInputConfig.prompt_file; before creating or appending
ToolInputConfig in the block that calls config.inputs.tools.append, validate
prompt_file (when non-empty) by normalizing it (reject absolute paths and
remove/resolve .. using os.path.abspath or equivalent) and ensure its absolute
path is within an allowed base directory (e.g., the project config directory)
using os.path.commonpath; if the resolved path is outside the base or contains
traversal, raise/return a config validation error and do not create the
ToolInputConfig instance. Ensure the validation logic is applied where
ToolInputConfig(prompt_file=...) is constructed so all inputs[].prompt_file
values are checked.

Comment on lines +65 to +75
```
rfe.speedrun
Phase 1: rfe.create — generate RFEs from prompts using rfe-template.md
Phase 2: rfe.auto-fix — batch review + revision loop
→ rfe.review — orchestrates per-RFE review
→ assess-rfe — 5-criterion rubric (what, why, open_to_how, not_a_task, right_sized)
→ rfe-feasibility-review — technical feasibility assessment
→ auto-revision (up to 2 cycles if score < 7 or any criterion = 0)
→ rfe.split — decompose oversized RFEs (up to 3 self-correction cycles)
Phase 3: rfe.submit — create/update Jira tickets (skipped with --dry-run)
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language identifier to the fenced code block.

The code fence opened at Line 65 uses bare ``` and triggers MD040. Use an explicit language (e.g., text) to avoid lint failures.

Suggested patch
-```
+```text
 rfe.speedrun
   Phase 1: rfe.create — generate RFEs from prompts using rfe-template.md
   Phase 2: rfe.auto-fix — batch review + revision loop
     → rfe.review — orchestrates per-RFE review
       → assess-rfe — 5-criterion rubric (what, why, open_to_how, not_a_task, right_sized)
       → rfe-feasibility-review — technical feasibility assessment
       → auto-revision (up to 2 cycles if score < 7 or any criterion = 0)
     → rfe.split — decompose oversized RFEs (up to 3 self-correction cycles)
   Phase 3: rfe.submit — create/update Jira tickets (skipped with --dry-run)
</details>

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 68-68: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@skills/eval-analyze-workspace/iteration-1/test-2-recursive-pipeline/with_skill/outputs/eval.md`
around lines 65 - 75, The fenced code block containing the rfe.speedrun pipeline
(the block that starts with ``` and lists Phase 1: rfe.create, Phase 2:
rfe.auto-fix, etc.) must include a language identifier to satisfy MD040; change
the opening fence from ``` to ```text (or another appropriate language) so the
block becomes ```text ... ```.

Comment on lines +75 to +77
if content.startswith("---"):
fm = yaml.safe_load(content.split("---")[1])
desc = (fm or {}).get("description", "")[:80]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

IndexError if YAML frontmatter delimiter is malformed.

content.split("---")[1] raises IndexError if the file starts with --- but has no closing delimiter. The outer except catches it, but the error message will be misleading.

Safer parsing
             if content.startswith("---"):
-                fm = yaml.safe_load(content.split("---")[1])
-                desc = (fm or {}).get("description", "")[:80]
+                parts = content.split("---", 2)
+                if len(parts) >= 2:
+                    fm = yaml.safe_load(parts[1])
+                    desc = (fm or {}).get("description", "")[:80]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-analyze/scripts/find_skills.py` around lines 75 - 77, The code
assumes a closing YAML delimiter when handling frontmatter and uses
content.split("---")[1], which raises IndexError for malformed frontmatter;
change the logic in the block that checks content.startswith("---") to safely
extract the frontmatter (e.g., use content.split("---", 2) and verify the result
has at least 3 parts or use content.partition("---")/find to locate the closing
delimiter) before calling yaml.safe_load, and only then set desc from fm to
avoid the IndexError and produce a clearer error path; update the variables fm
and desc in that branch accordingly.

Comment on lines +28 to +34
# Load handler config from workspace
config_path = Path("tool_handlers.yaml")
if not config_path.exists():
sys.exit(0)

with open(config_path) as f:
config = yaml.safe_load(f) or {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Relative path lookup may fail depending on CWD.

Path("tool_handlers.yaml") assumes CWD is the workspace root. Per workspace.py:218, the hook command is python3 {workspace}/hooks/tools.py with an absolute script path, but this script reads config via relative path. If Claude Code doesn't set CWD to workspace when executing PreToolUse hooks, tool_handlers.yaml won't be found and all interceptions silently pass through.

Suggested fix: derive config path from script location
     # Load handler config from workspace
-    config_path = Path("tool_handlers.yaml")
+    # Workspace root is parent of hooks/ directory where this script lives
+    script_dir = Path(__file__).resolve().parent
+    workspace_root = script_dir.parent
+    config_path = workspace_root / "tool_handlers.yaml"
     if not config_path.exists():
         sys.exit(0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/tools.py` around lines 28 - 34, The code currently
uses a relative Path("tool_handlers.yaml") (symbol: config_path) which can fail
if the current working directory isn't the workspace root; change the lookup to
derive the config file location from the script file location (use
Path(__file__).resolve().parent combined with "tool_handlers.yaml") so the
exists check and open(...) for yaml.safe_load(...) always target the config next
to tools.py rather than the CWD; update the config_path assignment in tools.py
accordingly and keep the subsequent exists check and file open logic as-is.

Comment on lines +93 to +95
options = q.get("options", [])
answer = options[0]["label"] if options else "yes"
answers[text] = answer
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential KeyError if option dict lacks "label" key.

Line 94 assumes options[0] contains a "label" key. If the dict structure differs, this raises KeyError.

Suggested fix
             options = q.get("options", [])
-            answer = options[0]["label"] if options else "yes"
+            answer = options[0].get("label", "yes") if options else "yes"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/eval-run/scripts/tools.py` around lines 93 - 95, The code assumes
options[0]["label"] exists and can raise KeyError; update the logic where
options and answer are set (the block using variables options, answer and
assigning to answers[text]) to safely access the label (e.g., use
options[0].get("label", ...) or check for a "label" key) and provide a sensible
default ("yes") when missing, ensuring no exception is thrown if options[0] is
missing or lacks "label".

astefanutti and others added 14 commits April 6, 2026 14:12
setuptools auto-discovery fails with "Multiple top-level packages"
because the flat layout includes tmp/, eval/, skills/, and agent_eval/.
Add explicit [tool.setuptools.packages.find] with include=["agent_eval*"]
so editable installs work.

Also move mlflow from required to optional dependencies — the harness
core (config, state, workspace) works without it. Users who want MLflow
tracing or make_judge can install with `pip install -e '.[mlflow]'`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Claude Code runner strips the subprocess environment to a safe
allowlist. When using Vertex AI, CLAUDE_CODE_USE_VERTEX must be
forwarded or the CLI falls back to direct API auth and fails with
"Your organization does not have access to Claude."

Add CLAUDE_CODE_USE_VERTEX, GOOGLE_APPLICATION_CREDENTIALS,
GOOGLE_CLOUD_PROJECT, CLOUDSDK_CONFIG, and
CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE to _SAFE_ENV_KEYS so Vertex AI
authentication works in headless eval runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_read_input() only accepted dict inputs (isinstance check), so YAML
files containing a list of entries were silently skipped, producing an
empty batch.yaml. Skills like rfe.speedrun use list-of-dicts input
format (each entry has prompt, priority, labels).

Fix _read_input to accept any non-None parsed data. When the parsed
content is a list, extend batch_entries (flattening) instead of
appending a nested list. Track entry_count per case in case_order.yaml
so the collection step can map outputs back to cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When inputs.tools is configured, .claude is added to skip_symlinks to
avoid overwriting the generated settings.json with a symlink to the
project's .claude/ directory. But this also drops .claude/skills/,
which contains the skill definitions the evaluated skill needs to
invoke sub-skills.

After symlinking, iterate .claude/ subdirectories and symlink each one
into the workspace's .claude/ (which already has settings.json from
_setup_tool_hooks). This preserves both the generated hooks config and
the project's skill definitions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MLflow's make_judge uses litellm which requires OPENAI_API_KEY by
default. When using Anthropic (direct or Vertex AI), LLM judges fail
at call time with an auth error.

Add _make_anthropic_llm_judge() that uses the Anthropic client directly
(same client the pairwise comparison already uses). When
ANTHROPIC_VERTEX_PROJECT_ID or ANTHROPIC_API_KEY is set, prefer this
over make_judge. The judge model defaults to claude-3-5-haiku and is
configurable via EVAL_JUDGE_MODEL env var.

The fallback handles {{ outputs }} template rendering by building a
text summary from the outputs["files"] dict, and parses the LLM
response as JSON with score/rationale fields.

Also fix detect_regressions() to handle None mean values — LLM judges
that error out produce mean=None, which crashed the < comparison
against the threshold float.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New report.py script generates a self-contained HTML report from
eval run results (summary.yaml, run_result.json, eval.yaml). Includes
scoring summary, regressions, pairwise comparison, per-case details
with judge results, output files, and baseline diffs.

The eval-run skill now saves its analysis (key findings, failure
patterns, recommendations) to analysis.md, which report.py renders
as a styled section in the HTML report — preserving the agent's
insights beyond the conversation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When background agents complete, Claude Code emits multiple result
events in stream-json mode. The previous code only read usage from
the last result object, missing tokens from earlier completions.

Now iterates all stdout lines to sum input/output/cache tokens across
all result events. Also adds cache_read and cache_create token counts.

Additionally reads system_prompt from runner_options for appending
custom system prompts via config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When eval.yaml arguments contain {prompt}, execute.py now replaces it
with the actual prompt text from the first entry in batch.yaml. This
lets skills that accept a prompt string directly (rather than a file
path) work with the batch execution model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three improvements to the HTML report:

- Replace the basic markdown-to-HTML converter with a proper parser
  that handles tables, fenced code blocks, ordered/unordered lists,
  headers (h2-h4), bold, italic, inline code, and links. Tables get
  automatic PASS/FAIL color coding.

- Render HTML output files inline via sandboxed iframes (srcdoc) with
  auto-height instead of showing raw HTML source in a pre block.

- Remove max-height/overflow-y from file previews so they display
  fully inline without competing with page scroll.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a skill processes all cases in a single invocation (batch mode),
the auto-detect prefix grouping in collect.py can misassign files.

New batch_pattern field in OutputConfig lets eval.yaml specify how
output files map to cases:
- "RFE-{n:03d}" → RFE-001 matches case 1, RFE-002 matches case 2
- "*" → shared directory, files copied to every case
- Empty (default) → existing auto-detect behavior

The _collect_batch() function also respects entry_count from
case_order.yaml for cases with multiple batch entries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MLflow's autolog claude command generates a Stop hook with bare
"python -c ..." which fails on macOS where only python3 exists.
After running autolog, replace with the absolute path to python3
(preserves virtualenv if active at setup time).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When tool interception hooks create a new .claude/settings.json in the
workspace, the project's permission allowlist was lost. Headless
--print mode requires explicit Bash permissions, so skills would fail
with permission denials in the workspace.

Now copies the allow list from the project's .claude/settings.json
into the generated workspace settings alongside the hook config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Long-running skill executions (minutes to hours) need progress
monitoring. Added a Monitoring Progress section to Step 3 with
guidance on tailing output, recognizing phase markers, and detecting
hangs or errors early.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document {{ outputs }} template variable for LLM judges — without
  it, judges can't see skill output files
- Add batch_pattern to outputs section
- Add outputs dict field reference (files, case_dir, metadata, etc.)
- Add concrete check judge examples (file existence, YAML validation)
- Clarify that check judges must use outputs dict, not filesystem paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
astefanutti and others added 13 commits April 7, 2026 21:07
Use claude-sonnet-4-6 without @20250514 pinning so the API resolves
to the latest version automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract _extract_usage() as a standalone function that parses all
  stream-json result events in one pass (tokens, cost, turns)
- Capture the full model ID from the system/init event (resolved_model)
  so run_result.json shows e.g. claude-opus-4-6-20250514 not just opus
- Return partial usage data on timeout instead of losing it
- Default subagent_model to the resolved full model when not specified

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move comparison instructions to system prompt for clearer separation
- Increase max_tokens to 4096 (pairwise reasoning can be long)
- Add regex fallback to find {"preferred": ...} when JSON parsing fails
- Import re at module level

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add pairwise summary row to scoring table (wins/losses/ties with
  WIN/LOSS/TIE status) instead of a separate section
- Show per-case pairwise badges (WIN/LOSS/TIE) in case headers
- Color case backgrounds by pairwise result (green=win, red=loss,
  yellow=tie)
- Remove standalone _render_pairwise section (now integrated)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assistant events each represent one API turn with per-turn usage
(input_tokens, output_tokens, cache tokens). Counting these gives
accurate totals including inline Skill calls and subagent processes.

Result events only carry cumulative cost, not per-turn breakdowns,
and subagent result events don't always include full usage. The cost
is still taken from the last result event (cumulative in Claude Code).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collect model IDs from assistant events in stream-json to discover
which models were actually used (main + subagents). The subagent_model
field in run_result.json now shows the actual subagent models observed
rather than just the CLI flag value.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
score.py judges now exits with code 1 when any threshold regression
is detected, allowing the eval-run agent to report the failure and
stop before proceeding to further steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The result event's modelUsage field contains per-model token totals
that include all subagent usage. This is more accurate than summing
assistant events, which can miss tokens from background agents that
complete after the main process.

Falls back to assistant event accumulation when modelUsage is absent
(older Claude Code versions or non-stream-json output).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Display the cache hit rate (cache_read / total_input) as a percentage
alongside the raw cache read/write counts. Helps assess prompt caching
efficiency at a glance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When outputs[].path points to a file rather than a directory,
collect.py now handles it correctly instead of failing silently.
This supports eval.yaml configs where an output is a specific
file (e.g. a generated report) rather than a directory of artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Color all case backgrounds green/red based on pass/fail status
  (previously only pairwise cases had colored backgrounds)
- Use fixed 80vh height for HTML iframes instead of JS onload
  auto-sizing (more reliable across browsers)
- Handle single-file shared output paths in _render_shared_outputs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Piping execute.py through tail/grep/head buffers all output and
prevents progress monitoring. The SKILL.md now explicitly instructs
to run the bare command with run_in_background and read the output
file separately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
astefanutti and others added 11 commits April 10, 2026 18:00
New inject_tracing_hook() writes the MLflow Stop hook directly into
the eval workspace's .claude/settings.json before skill execution.
This replaces the manual setup_autolog step in /eval-setup:

- Tracing is now automatic when mlflow_experiment is configured
- The outer project's settings are never modified
- Handles symlinked .claude dirs by replacing with real dir + re-symlinking subdirs
- Resolves python3 path for the hook command

The eval-setup Step 6 is simplified to document that tracing is automatic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mlflow.log_table() expects a dict mapping column names to lists of
values, not a list of row dicts. Convert the row-oriented table_rows
to column-oriented format before logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
claude --print doesn't emit the stdin prompt as a stream-json event,
so downstream consumers (trace builder, MLflow) have no record of
the input. Now injects a synthetic "user" event with the prompt text
at the start of the stream.

Also adds receive timestamps to assistant events that lack them, so
traces have wall-clock timing for every event. Lines are appended
after JSON processing to ensure timestamps are in the stored output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New inject_tracing_env() sets MLFLOW_TRACKING_URI and experiment name
in the workspace settings without adding the Stop hook. The Stop hook
fires for every subagent session, creating fragmented traces. Instead,
traces are created post-hoc by /eval-mlflow from the stream-json log.

Also ensures all MLflow scripts call set_tracking_uri at import time
so they connect to the configured server (not local mlruns/).

inject_tracing_hook is kept for use cases that want per-session traces.
Both functions now accept experiment_name to tag traces correctly.
ensure_server uses /version endpoint (lighter than /experiments/search).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New trace builder in log_results.py creates a single hierarchical
MLflow trace from the stdout.log stream-json events:

- Root AGENT span covering the entire execution
- LLM spans for each assistant response (with model, tokens, cost)
- TOOL spans for tool calls (Bash, Read, Write, Skill, etc.)
- RETRIEVER spans for file reads
- Proper parent-child nesting and wall-clock timing

This replaces the Stop hook approach which created fragmented traces
(one per subagent session). The consolidated trace shows the full
execution flow in MLflow's trace viewer.

Also links experiment traces to the MLflow run via tags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New Step 2 verifies project artifact directories are clean before
workspace setup. Stale artifacts from previous runs (wrong IDs, old
run reports) contaminate results. The preflight script checks all
output paths plus tmp/ state files and the target run directory.

Steps renumbered: old 2-7 become 3-8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
On macOS /tmp symlinks to /private/tmp. Claude Code resolves file
paths to canonical form before matching permission patterns, so
Write(/tmp/foo/**) won't match /private/tmp/foo/bar. Now detects
symlinked prefixes in permission patterns and adds resolved variants.

Also resolves additionalDirectories paths and handles file-extension
output paths (create parent dir instead of treating as directory).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… parsing

Report: when a baseline is provided, output files under output_paths
were shown twice — once in "Output files" and again in "Baseline diff".
Now excludes output_path files from the output listing when a baseline
diff is rendered. Also handles missing dirs in either side of the diff.

Score: pairwise JSON extraction now uses balanced brace matching instead
of a simple regex, handling nested JSON objects in the LLM response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Background agents write their conversation to .jsonl files that
Claude Code deletes when the session ends. The runner now:

- Tracks output file paths from tool_result events (agentId + path)
- Reads files in-flight on task_notification (while process is alive)
- Does a final sweep after process exit via resolved symlinks and
  .claude/projects/ session directory
- Saves captured outputs to <run_dir>/subagents/<agentId>.jsonl

Session persistence is kept on (--no-session-persistence removed)
so subagent files survive long enough to be captured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Trace builder improvements:

- Resolve subagent output files from multiple sources: original path,
  saved copies in <run_dir>/subagents/, resolved symlinks, and
  .claude/projects/ session directory
- Include LLM reasoning text blocks as LLM-type spans alongside tool
  call spans in subagent children (previously only tool calls were
  captured)
- Fix child timestamp iteration to handle mixed span types (tool spans
  have tuids for timing, LLM spans don't)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Capture tool output text (truncated to 500 chars) and attach it to
TOOL/AGENT spans as outputs. Move LLM reasoning text from inputs to
outputs and inject recent tool context into LLM span inputs so the
trace shows what the model was responding to.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kami619
Copy link
Copy Markdown

kami619 commented Apr 13, 2026

@astefanutti

I ran the eval harness against the https://github.com/fege/test-plan/tree/main/.claude/skills/test-plan.create SKILL and the end-to-end test run showed the framework works and provides useful scoring/regression signals, but the test also surfaced 9 pipeline-skill gaps (notably per-case execution, case file provisioning, args templating, dynamic output collection, and {{ inputs }} resolution).

Based on our offline chat, I think its best to merge and fix the needed issues based on priority.

@astefanutti
Copy link
Copy Markdown
Contributor Author

@kami619 thanks, this is very useful feedback. I agree this PR has already grown quite large, so better merge it and iterate on the issues your analysis has identified.

@astefanutti astefanutti merged commit 3fccefa into main Apr 13, 2026
1 check passed
@astefanutti astefanutti deleted the pr-01 branch April 13, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants