Skip to content

deepagency/afterdemo

Repository files navigation

afterdemo-banner

afterdemo is a diagnostic rubric for AI agent builders.
It helps you find what breaks after the demo works — before your users do.


License Format Framework PRs Welcome


Start the Worksheet →  ·  Browse Failure Archetypes →  ·  Read the Rubric ↓


The Problem

Your agent works in the demo. Behavior is plausible, outputs look right, the stakeholders are impressed. You ship it.

Three weeks later: a support ticket, a surprise bill, a behavior change nobody can explain, a task that completed "successfully" but produced the wrong result. You dig in. The logs are clean. The model didn't crash. Something subtler went wrong — and you're not sure where to look.

This is the demo-to-production gap. It's not a model problem. It's a system problem.

Most builders respond by tweaking prompts. This rarely helps. The real issues are structural — no visibility into what the agent actually did, no boundary between "agent failed" and "tool failed," no expectation that behavior drifts over time. Prompts don't fix structure.

afterdemo gives you a vocabulary and a checklist for the structural issues — before they teach you the hard way.


What's In This Repo

File What it does Time
README.md The rubric — 8 diagnostic dimensions 10 min read
WORKSHEET.md Self-assessment you fill out before adding features 5–10 min
FAILURE_ARCHETYPES.md 7 named failure patterns with first actions Reference
CONTRIBUTING.md How to contribute — small, opinionated, practical 3 min read

Something just broke? Go to FAILURE_ARCHETYPES.md first.
Planning your next feature? Run the WORKSHEET.md before you start.


The Rubric

Eight dimensions. For each one, the low-readiness description is what to watch for — it's usually where the trouble is coming from.


1 · Mental Model

Do you treat the agent as a smart prompt, or as a system that will evolve?

🔴 Low — One large prompt doing everything. Behavior changes are confusing or surprising when they happen.

🟢 Ready — Clear separation between reasoning, tools, and state. Change is expected and planned for, not reacted to.


2 · Replaceability

Can you swap parts of the agent faster than you can debug them?

🔴 Low — Model, prompt, and logic are tightly coupled. Nobody wants to touch the parts that are "working."

🟢 Ready — Prompts and models are treated as swappable components. Replacement is cheaper than repair.


3 · Failure Legibility

When the agent fails, do you know why?

🔴 Low — Failures are silent or opaque. You can't tell if the model reasoned poorly or if something in the environment went wrong.

🟢 Ready — Failures leave clear artifacts — logs, traces, outputs. You can distinguish a reasoning error from a tool error from an environment error.


4 · Observability

Can you see what the agent actually did, step by step?

🔴 Low — Only final outputs are visible. Debugging means staring at the result and guessing how it got there.

🟢 Ready — Intermediate decisions and tool calls are logged. You can compare runs over time and see exactly where a run went sideways.


5 · Tool Boundaries

What happens when a tool changes or misbehaves?

🔴 Low — Tools are assumed to always work. When they don't, the agent often reports success anyway.

🟢 Ready — Tool failures are explicitly handled. There is a clear distinction between "the agent's reasoning failed" and "the environment failed."


6 · Cost Awareness

Do you know what this agent costs to run — and why?

🔴 Low — Cost is only noticed after a surprise bill. No per-run budgets or limits exist.

🟢 Ready — Explicit cost limits exist in code. You know which behaviors are expensive and can see cost growing in real time, not after the fact.


7 · Drift Awareness

Do you expect behavior to change over time?

🔴 Low — "It used to work" is a common explanation. Changes are patched reactively without understanding the root cause.

🟢 Ready — Behavior change is treated as an expected property of the system. You have a baseline and can detect when it shifts.


8 · Human-in-the-Loop

Where do humans step in — and why?

🔴 Low — Full autonomy by default. Humans are only involved when something has already broken badly.

🟢 Ready — Clear handoff points are defined and designed. Humans act as a stabilizing mechanism — not a last resort.


How to Use It

1. Pick one agent you're building or maintaining right now.
2. Read each dimension above.
3. For each one, ask honestly: does the low-readiness description fit?
4. Find your lowest-readiness dimension.
5. Fix that before adding any new features.

Most agent pain comes from ignoring the weakest link. The full worksheet takes 5–10 minutes and ends with a single action item: WORKSHEET.md →


Failure Archetypes

These are the patterns that appear most often in production agent systems. If something just broke, start here.

Archetype The symptom Rubric signal
🔁 The Ghost Loop Running for ten minutes. Nothing completing. Failure Legibility · Cost Awareness
🤥 The Confident Liar The agent said it worked. It didn't. Tool Boundaries · Observability
📚 The Prompt Avalanche Every fix adds more instructions. Nobody understands it anymore. Mental Model · Replaceability
👻 The Phantom Success Logs are clean. The output is completely wrong. Observability · Tool Boundaries
🌀 The Drifter Nothing changed. It just started behaving differently. Drift Awareness · Observability
💸 The Runaway Tab The bill was enormous. Nobody noticed until the invoice. Cost Awareness · Human-in-the-Loop
🧊 The Frozen Handoff It worked in testing. In production, humans ignore it. Human-in-the-Loop · Failure Legibility

Each archetype includes a rubric dimension mapping and a concrete "what to do first" — something actionable today, without a new platform or framework.

Browse all archetypes →


A Quick Example

"Our agent keeps looping and burning tokens."

Run it through the rubric:

  • Failure Legibility 🔴 — the loop produces no error signal
  • Observability 🔴 — no step-level view of what the agent is doing
  • Cost Awareness 🔴 — spend grows silently with every iteration

The problem isn't the model's reasoning. It's missing system visibility and boundaries.

This is The Ghost Loop. The first action: add a hard iteration cap in code — not in the prompt — and log every step. That turns a mystery into a recoverable error.


Design Principles

Diagnosis over prescription. The goal is to help you see the structural problem clearly. Solutions depend on your stack and constraints — afterdemo doesn't pretend to know those.

No scoring math. Weighted averages create false precision and invite gaming. The weakest link matters more than the average.

No vendor guidance. Tool recommendations go stale, create bias, and distract from structural thinking. afterdemo is tool-agnostic by design.

Readable in one sitting. If it takes more than 15 minutes to absorb the whole thing, it's too complex to be useful at the moment it's needed.

Builder-first. A solo builder should be able to apply every part of this without a team, a platform, or a budget.


Contributing

Contributions are welcome — but only if they make the rubric more useful to builders trying to understand why their agents feel brittle.

Good fit: clearer wording, better examples, new failure archetypes grounded in real patterns, sharper diagnostic signals for existing dimensions.

Not a fit: scoring systems, vendor-specific guidance, governance frameworks, anything that makes this take longer than 15 minutes to absorb.

See CONTRIBUTING.md for full guidelines.


License

Apache 2.0 — free to use, modify, and share, including in commercial and educational contexts, with attribution.


You don't "solve" agents. You build systems that can fail visibly, change safely, and improve over time.
afterdemo is a tool to help you do that — before the system teaches you the hard way.

About

A diagnostic rubric for AI agent builders — find what breaks after the demo works.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors