Use this worksheet before adding new features to your agent.
It takes 5–10 minutes. The goal is to find your weakest link — not to score well.
- Read each dimension.
- Circle or mark the description that best fits your system right now.
- Be honest. "Partial" is a real answer.
- Find the dimension where you marked Low — that's where to focus next.
If you have more than one Low, fix the one that would hurt most in production.
Do you treat the agent as a smart prompt, or as a system that will evolve?
| 🔴 Low | One large prompt doing everything. Behavior changes are confusing or surprising. |
| 🟡 Partial | Some separation of concerns, but it's informal or inconsistent. |
| 🟢 Ready | Clear separation between reasoning, tools, and state. Change is expected and planned for. |
My rating: _______________
Notes / what's missing:
Can you swap parts of the agent faster than you can debug them?
| 🔴 Low | Model, prompt, and logic are tightly coupled. Touching "working" parts feels risky. |
| 🟡 Partial | Some parts are swappable, but others are deeply entangled. |
| 🟢 Ready | Prompts and models are treated as swappable. Replacement is cheaper than repair. |
My rating: _______________
Notes / what's missing:
When the agent fails, do you know why?
| 🔴 Low | Failures are silent. Explanations are "it just didn't work." |
| 🟡 Partial | Some failures are visible, but others disappear without a trace. |
| 🟢 Ready | Failures leave artifacts (logs, traces, outputs). You can tell whether the model failed or the environment did. |
My rating: _______________
Notes / what's missing:
Can you see what the agent actually did, step by step?
| 🔴 Low | Only final outputs are visible. Debugging relies on intuition and guesswork. |
| 🟡 Partial | Some steps are logged, but intermediate decisions are opaque. |
| 🟢 Ready | Intermediate decisions and tool calls are visible. You can compare runs over time. |
My rating: _______________
Notes / what's missing:
What happens when a tool changes or misbehaves?
| 🔴 Low | Assumption that tools always work. Agent confidently reports success even when tools fail. |
| 🟡 Partial | Some tool failures are caught, but not consistently handled. |
| 🟢 Ready | Tool failures are explicitly handled. "Agent logic failed" is distinguishable from "environment failed." |
My rating: _______________
Notes / what's missing:
Do you know what this agent costs to run — and why?
| 🔴 Low | Costs are only noticed after surprise bills or problems. No limits in place. |
| 🟡 Partial | Rough sense of cost, but no explicit limits or per-behavior tracking. |
| 🟢 Ready | Explicit cost limits exist. You know which behaviors are expensive and why. |
My rating: _______________
Notes / what's missing:
Do you expect behavior to change over time?
| 🔴 Low | "It used to work" is a common explanation. Changes are patched reactively. |
| 🟡 Partial | Drift is acknowledged, but there's no systematic way to detect or explain it. |
| 🟢 Ready | Behavior change is expected. You can detect drift and explain what changed. |
My rating: _______________
Notes / what's missing:
Where do humans step in — and why?
| 🔴 Low | Full autonomy by default. Humans only get involved when things break badly. |
| 🟡 Partial | Some handoff points exist, but they're informal or inconsistent. |
| 🟢 Ready | Clear handoff points are defined. Humans act as a stabilizing mechanism, not a last resort. |
My rating: _______________
Notes / what's missing:
Fill this in after rating all dimensions.
| Dimension | Rating |
|---|---|
| 1. Mental Model | |
| 2. Replaceability | |
| 3. Failure Legibility | |
| 4. Observability | |
| 5. Tool Boundaries | |
| 6. Cost Awareness | |
| 7. Drift Awareness | |
| 8. Human-in-the-Loop |
My lowest-readiness dimension: _______________
What I'll address before the next feature: _______________
There's no passing score.
There's no certification.
A single Low on the right dimension — especially Failure Legibility or Observability — will cause more pain than five Partials elsewhere.
Fix your weakest link first. Then reassess.
Part of the Agent Readiness Rubric.