Skip to content

Latest commit

 

History

History
194 lines (116 loc) · 4.85 KB

File metadata and controls

194 lines (116 loc) · 4.85 KB

Agent Readiness Self-Assessment Worksheet

Use this worksheet before adding new features to your agent.
It takes 5–10 minutes. The goal is to find your weakest link — not to score well.


How to Use This

  1. Read each dimension.
  2. Circle or mark the description that best fits your system right now.
  3. Be honest. "Partial" is a real answer.
  4. Find the dimension where you marked Low — that's where to focus next.

If you have more than one Low, fix the one that would hurt most in production.


The Dimensions


1. Mental Model

Do you treat the agent as a smart prompt, or as a system that will evolve?

🔴 Low One large prompt doing everything. Behavior changes are confusing or surprising.
🟡 Partial Some separation of concerns, but it's informal or inconsistent.
🟢 Ready Clear separation between reasoning, tools, and state. Change is expected and planned for.

My rating: _______________

Notes / what's missing:


2. Replaceability

Can you swap parts of the agent faster than you can debug them?

🔴 Low Model, prompt, and logic are tightly coupled. Touching "working" parts feels risky.
🟡 Partial Some parts are swappable, but others are deeply entangled.
🟢 Ready Prompts and models are treated as swappable. Replacement is cheaper than repair.

My rating: _______________

Notes / what's missing:


3. Failure Legibility

When the agent fails, do you know why?

🔴 Low Failures are silent. Explanations are "it just didn't work."
🟡 Partial Some failures are visible, but others disappear without a trace.
🟢 Ready Failures leave artifacts (logs, traces, outputs). You can tell whether the model failed or the environment did.

My rating: _______________

Notes / what's missing:


4. Observability

Can you see what the agent actually did, step by step?

🔴 Low Only final outputs are visible. Debugging relies on intuition and guesswork.
🟡 Partial Some steps are logged, but intermediate decisions are opaque.
🟢 Ready Intermediate decisions and tool calls are visible. You can compare runs over time.

My rating: _______________

Notes / what's missing:


5. Tool Boundaries

What happens when a tool changes or misbehaves?

🔴 Low Assumption that tools always work. Agent confidently reports success even when tools fail.
🟡 Partial Some tool failures are caught, but not consistently handled.
🟢 Ready Tool failures are explicitly handled. "Agent logic failed" is distinguishable from "environment failed."

My rating: _______________

Notes / what's missing:


6. Cost Awareness

Do you know what this agent costs to run — and why?

🔴 Low Costs are only noticed after surprise bills or problems. No limits in place.
🟡 Partial Rough sense of cost, but no explicit limits or per-behavior tracking.
🟢 Ready Explicit cost limits exist. You know which behaviors are expensive and why.

My rating: _______________

Notes / what's missing:


7. Drift Awareness

Do you expect behavior to change over time?

🔴 Low "It used to work" is a common explanation. Changes are patched reactively.
🟡 Partial Drift is acknowledged, but there's no systematic way to detect or explain it.
🟢 Ready Behavior change is expected. You can detect drift and explain what changed.

My rating: _______________

Notes / what's missing:


8. Human-in-the-Loop

Where do humans step in — and why?

🔴 Low Full autonomy by default. Humans only get involved when things break badly.
🟡 Partial Some handoff points exist, but they're informal or inconsistent.
🟢 Ready Clear handoff points are defined. Humans act as a stabilizing mechanism, not a last resort.

My rating: _______________

Notes / what's missing:


Summary

Fill this in after rating all dimensions.

Dimension Rating
1. Mental Model
2. Replaceability
3. Failure Legibility
4. Observability
5. Tool Boundaries
6. Cost Awareness
7. Drift Awareness
8. Human-in-the-Loop

My lowest-readiness dimension: _______________

What I'll address before the next feature: _______________


A Note on Scoring

There's no passing score.
There's no certification.

A single Low on the right dimension — especially Failure Legibility or Observability — will cause more pain than five Partials elsewhere.

Fix your weakest link first. Then reassess.


Part of the Agent Readiness Rubric.