afterdemo is a diagnostic rubric for AI agent builders.
It helps you find what breaks after the demo works — before your users do.
Start the Worksheet → · Browse Failure Archetypes → · Read the Rubric ↓
Your agent works in the demo. Behavior is plausible, outputs look right, the stakeholders are impressed. You ship it.
Three weeks later: a support ticket, a surprise bill, a behavior change nobody can explain, a task that completed "successfully" but produced the wrong result. You dig in. The logs are clean. The model didn't crash. Something subtler went wrong — and you're not sure where to look.
This is the demo-to-production gap. It's not a model problem. It's a system problem.
Most builders respond by tweaking prompts. This rarely helps. The real issues are structural — no visibility into what the agent actually did, no boundary between "agent failed" and "tool failed," no expectation that behavior drifts over time. Prompts don't fix structure.
afterdemo gives you a vocabulary and a checklist for the structural issues — before they teach you the hard way.
| File | What it does | Time |
|---|---|---|
README.md |
The rubric — 8 diagnostic dimensions | 10 min read |
WORKSHEET.md |
Self-assessment you fill out before adding features | 5–10 min |
FAILURE_ARCHETYPES.md |
7 named failure patterns with first actions | Reference |
CONTRIBUTING.md |
How to contribute — small, opinionated, practical | 3 min read |
Something just broke? Go to FAILURE_ARCHETYPES.md first.
Planning your next feature? Run the WORKSHEET.md before you start.
Eight dimensions. For each one, the low-readiness description is what to watch for — it's usually where the trouble is coming from.
Do you treat the agent as a smart prompt, or as a system that will evolve?
🔴 Low — One large prompt doing everything. Behavior changes are confusing or surprising when they happen.
🟢 Ready — Clear separation between reasoning, tools, and state. Change is expected and planned for, not reacted to.
Can you swap parts of the agent faster than you can debug them?
🔴 Low — Model, prompt, and logic are tightly coupled. Nobody wants to touch the parts that are "working."
🟢 Ready — Prompts and models are treated as swappable components. Replacement is cheaper than repair.
When the agent fails, do you know why?
🔴 Low — Failures are silent or opaque. You can't tell if the model reasoned poorly or if something in the environment went wrong.
🟢 Ready — Failures leave clear artifacts — logs, traces, outputs. You can distinguish a reasoning error from a tool error from an environment error.
Can you see what the agent actually did, step by step?
🔴 Low — Only final outputs are visible. Debugging means staring at the result and guessing how it got there.
🟢 Ready — Intermediate decisions and tool calls are logged. You can compare runs over time and see exactly where a run went sideways.
What happens when a tool changes or misbehaves?
🔴 Low — Tools are assumed to always work. When they don't, the agent often reports success anyway.
🟢 Ready — Tool failures are explicitly handled. There is a clear distinction between "the agent's reasoning failed" and "the environment failed."
Do you know what this agent costs to run — and why?
🔴 Low — Cost is only noticed after a surprise bill. No per-run budgets or limits exist.
🟢 Ready — Explicit cost limits exist in code. You know which behaviors are expensive and can see cost growing in real time, not after the fact.
Do you expect behavior to change over time?
🔴 Low — "It used to work" is a common explanation. Changes are patched reactively without understanding the root cause.
🟢 Ready — Behavior change is treated as an expected property of the system. You have a baseline and can detect when it shifts.
Where do humans step in — and why?
🔴 Low — Full autonomy by default. Humans are only involved when something has already broken badly.
🟢 Ready — Clear handoff points are defined and designed. Humans act as a stabilizing mechanism — not a last resort.
1. Pick one agent you're building or maintaining right now.
2. Read each dimension above.
3. For each one, ask honestly: does the low-readiness description fit?
4. Find your lowest-readiness dimension.
5. Fix that before adding any new features.
Most agent pain comes from ignoring the weakest link. The full worksheet takes 5–10 minutes and ends with a single action item: WORKSHEET.md →
These are the patterns that appear most often in production agent systems. If something just broke, start here.
| Archetype | The symptom | Rubric signal |
|---|---|---|
| 🔁 The Ghost Loop | Running for ten minutes. Nothing completing. | Failure Legibility · Cost Awareness |
| 🤥 The Confident Liar | The agent said it worked. It didn't. | Tool Boundaries · Observability |
| 📚 The Prompt Avalanche | Every fix adds more instructions. Nobody understands it anymore. | Mental Model · Replaceability |
| 👻 The Phantom Success | Logs are clean. The output is completely wrong. | Observability · Tool Boundaries |
| 🌀 The Drifter | Nothing changed. It just started behaving differently. | Drift Awareness · Observability |
| 💸 The Runaway Tab | The bill was enormous. Nobody noticed until the invoice. | Cost Awareness · Human-in-the-Loop |
| 🧊 The Frozen Handoff | It worked in testing. In production, humans ignore it. | Human-in-the-Loop · Failure Legibility |
Each archetype includes a rubric dimension mapping and a concrete "what to do first" — something actionable today, without a new platform or framework.
"Our agent keeps looping and burning tokens."
Run it through the rubric:
- Failure Legibility 🔴 — the loop produces no error signal
- Observability 🔴 — no step-level view of what the agent is doing
- Cost Awareness 🔴 — spend grows silently with every iteration
The problem isn't the model's reasoning. It's missing system visibility and boundaries.
This is The Ghost Loop. The first action: add a hard iteration cap in code — not in the prompt — and log every step. That turns a mystery into a recoverable error.
Diagnosis over prescription. The goal is to help you see the structural problem clearly. Solutions depend on your stack and constraints — afterdemo doesn't pretend to know those.
No scoring math. Weighted averages create false precision and invite gaming. The weakest link matters more than the average.
No vendor guidance. Tool recommendations go stale, create bias, and distract from structural thinking. afterdemo is tool-agnostic by design.
Readable in one sitting. If it takes more than 15 minutes to absorb the whole thing, it's too complex to be useful at the moment it's needed.
Builder-first. A solo builder should be able to apply every part of this without a team, a platform, or a budget.
Contributions are welcome — but only if they make the rubric more useful to builders trying to understand why their agents feel brittle.
Good fit: clearer wording, better examples, new failure archetypes grounded in real patterns, sharper diagnostic signals for existing dimensions.
Not a fit: scoring systems, vendor-specific guidance, governance frameworks, anything that makes this take longer than 15 minutes to absorb.
See CONTRIBUTING.md for full guidelines.
Apache 2.0 — free to use, modify, and share, including in commercial and educational contexts, with attribution.
You don't "solve" agents. You build systems that can fail visibly, change safely, and improve over time.
afterdemo is a tool to help you do that — before the system teaches you the hard way.