This roadmap outlines multiple iterations for ABCA. Each iteration adds features incrementally and builds on the previous one. Delivering a working slice at the end of each iteration is the goal. Non–backward-compatible changes between iterations are acceptable (e.g. switching CLI auth from IAM to Cognito, or changing the orchestration model) when they simplify the design or align with the target architecture.
The order and scope of items may shift as we learn; the list below reflects current design docs (ARCHITECTURE.md and component docs in docs/design/).
These practices apply continuously across iterations and are not treated as one-time feature milestones.
- Property-based correctness testing for orchestration invariants — Complement example-based tests (Jest/pytest) with property-based testing (
fast-checkfor TypeScript andhypothesisfor Python) so randomized inputs and interleavings validate invariants over many runs. The goal is to verify safety properties that are timing-sensitive or hard to cover with scenario tests alone (for example, concurrent state transitions and lock/contention behavior). - Machine-readable property catalog — Maintain a versioned property set with explicit mapping from each property to enforcing code paths and tests. Initial properties include:
P-ABCA-1terminal-state immutability: tasks inCOMPLETED/FAILED/CANCELLED/TIMED_OUTcannot transition further.P-ABCA-2concurrency counter consistency: for each user,active_countequals the number of tasks in active states (SUBMITTED,HYDRATING,RUNNING,FINALIZING).P-ABCA-3event ordering:TaskEventsare strictly monotonic byevent_id(ULID order).P-ABCA-4memory fallback guarantee: if task finalization seesmemory_written = false, fallback episode write is attempted and result is observable.P-ABCA-5branch-name uniqueness: simultaneous tasks for the same repo generate distinct branch names (ULID-based suffix).
- Definition-of-done hook — New orchestrator/concurrency changes should include: updated property mappings, at least one property-based test where applicable, and invariant notes in
ORCHESTRATOR.mdto keep docs and executable checks aligned.
Goal: An agent runs on AWS in an isolated environment; user submits a task from the CLI and gets a PR when done.
- Agent on AWS — Agent runs in a sandboxed compute environment (AgentCore Runtime MicroVM or equivalent). Each task gets an isolated session (compute, memory, filesystem). Container/image has shell, filesystem, dev tooling; session isolation is built-in.
- CLI trigger — User can submit a task via CLI (script or simple CLI): provide repo + task description (text and/or GitHub issue ref). Single entry path; no multi-channel yet.
- Autonomous agent loop — Agent SDK runs with full tool access in headless mode (read, write, edit, bash, glob, grep;
permissionMode: "bypassPermissions"or equivalent). No human prompts during execution. - Git workflow — Agent creates a branch, commits incrementally, pushes to GitHub, and creates a pull request when done. Branch naming convention: e.g.
bgagent/<task-id>/<short-desc>. - GitHub only — Single git provider (GitHub). Agent clones repo, works on branch, opens PR via GitHub API (OAuth or token via AgentCore Identity).
- Minimal orchestration — Task is created, execution is triggered (e.g. Lambda or direct invoke), agent runs to completion or failure. Platform infers outcome from GitHub (PR created or not) or from session end. No durable orchestration (e.g. no Step Functions / Durable Functions) required for this slice if we accept "fire-and-forget" plus polling.
- Task state (minimal) — At least: task id, status (e.g. running / completed / failed), repo, and way to poll or wait for completion. Persistence can be minimal (e.g. DynamoDB or single table).
- API authentication — CLI authenticates to the API (e.g. IAM SigV4 or Cognito JWT). Prevents unauthorized task submission.
- Scaling — Each task runs in its own isolated session; no shared mutable state so the system can scale with concurrent tasks (within runtime quotas).
Out of scope for Iteration 1: Repo onboarding (any repo the credentials can access is allowed), multiple channels, durable execution with checkpoint/resume, rich observability, memory/code attribution, webhook, Slack.
Goal: Robust task lifecycle, durable execution, security foundations, basic cost guardrails, and visibility into what's running. This iteration makes the platform production-grade for single-channel (CLI) usage.
- Task management — Submit, list (e.g. my tasks), get status (per task), cancel (stop a running task). Clear task state machine (SUBMITTED → HYDRATING → RUNNING → FINALIZING → COMPLETED / FAILED / CANCELLED / TIMED_OUT). See ORCHESTRATOR.md.
- API contract — Implement the external API:
POST /v1/tasks,GET /v1/tasks,GET /v1/tasks/{id},DELETE /v1/tasks/{id},GET /v1/tasks/{id}/events. Consistent error format, pagination, idempotency. See API_CONTRACT.md. - Input gateway (single entry point) — All requests go through one gateway: verify auth, normalize payload to an internal message schema, validate (required fields, repo/issue refs), then dispatch to the task pipeline. The gateway is designed for extensibility — adding new channels later requires only new adapters, not core changes. In this iteration, CLI is the only channel; the gateway architecture is established so future channels (webhook, Slack) plug in cleanly. See INPUT_GATEWAY.md.
- Idempotency — Task submit accepts an idempotency key (e.g.
Idempotency-Keyheader); duplicate submits with the same key do not create a second task. Prevents duplicate work on retries. Keys are stored with a 24-hour TTL. - Improve CLI — Dedicated CLI package (
@abca/cliincli/) with commands:configure,login,submit,list,status,cancel,events. Cognito auth with token caching and auto-refresh,--waitmode that polls until completion,--output jsonfor scripting, and--verbosefor debugging.
- Durable execution — Orchestrator on top of the agent using Lambda Durable Functions: checkpoint/resume, async session monitoring via DynamoDB polling, timeout recovery, idempotent step execution. Long-running sessions (hours) survive transient failures; agent commits regularly so work is not lost. See ORCHESTRATOR.md for the task state machine, execution model, failure modes, concurrency management, data model, and implementation strategy.
- Storage — (1) Task and event storage — Tasks table, TaskEvents (audit log), UserConcurrency counters in DynamoDB. (2) Durable execution state — Lambda Durable Functions checkpoints (managed by the service). (3) Artifact storage (optional) — S3 bucket for future screenshot/video uploads.
- Threat model — Document the threat model for the current architecture using threat-composer. Cover: input validation, agent isolation, credential management, data flow, and trust boundaries. Update the threat model as new features land in future iterations. Threat modeling informs the security controls built in this and subsequent iterations — it must come before, not after, the production gateway and orchestrator.
- Network isolation (basic) — Deploy the agent compute environment within a VPC. Restrict outbound egress to allowlisted endpoints: GitHub API, Amazon Bedrock, AgentCore services, and necessary AWS service endpoints (DynamoDB, CloudWatch, S3). No open internet access by default. This prevents a compromised or confused agent from reaching arbitrary endpoints. Fine-grained per-repo allowlisting and egress logging are deferred to Iteration 3a.
- Observability — Metrics: task duration, token usage (from agent SDK result), cold start, error rate, active task counts, and submitted backlog. Dashboards: active tasks, submitted backlog, completion rate, basic task list. Alarms: stuck tasks (e.g. RUNNING > 9 hours), sustained submitted backlog over threshold, orchestration failures, counter drift. Logs: Agent/runtime logs (e.g. CloudWatch) tied to task id. See OBSERVABILITY.md.
Builds on Iteration 1: Same agent + git workflow; adds orchestrator, gateway, task CRUD, API contract, observability, security foundations, and cost guardrails.
Out of scope for Iteration 2: Webhook trigger (no second channel yet), multi-modal input (text-based tasks are sufficient), repo onboarding, memory, customization.
Goal: Only onboarded repos can receive tasks; per-repo credentials replace the single shared OAuth token; agent environment is customizable per repo.
- Repository onboarding pipeline — Repos must be onboarded before tasks can target them. Onboarding registers a repo with the platform and produces a per-repo agent configuration (workload, security, customization). Submitting a task for a non-onboarded repo returns an error (
REPO_NOT_ONBOARDED). The pipeline can discover static config (e.g. rules, README) and optionally generate dynamic artifacts (summaries, dependency graphs). See REPO_ONBOARDING.md. - Basic customization: prompt from repo — The full project-level configuration scope is loaded at runtime via the Claude Agent SDK's
setting_sources=["project"]parameter. This includesCLAUDE.md/.claude/CLAUDE.md(instructions),.claude/rules/*.md(path-scoped rules),.claude/settings.json(project settings, hooks, env),.claude/agents/(custom subagents), and.mcp.json(MCP servers). The CLI natively discovers and injects these — no custom file parsing needed. Additionally, Blueprintsystem_prompt_overridesfrom DynamoDB are wired throughserver.py→entrypoint.pyand appended after template substitution. Composable prompt model: platform default + Blueprint overrides (appended) + repo-level project configuration (loaded by CLI). - Network isolation (fine-grained) — Route 53 Resolver DNS Firewall enforces a platform-wide domain allowlist. Per-repo
networking.egressAllowlistfeeds the aggregate policy (VPC-wide, not per-session). DNS query logging provides egress audit. Deployed in observation mode (ALERT) with a path to enforcement mode (BLOCK). See NETWORK_ARCHITECTURE.md and SECURITY.md. - Webhook / API trigger — Expose task submission as a webhook (HMAC-authenticated) so external systems can create tasks programmatically. Same API contract as CLI; gateway normalizes and validates. This is the foundation for GitHub Actions integration and CI-triggered tasks. Webhook management API (create/list/revoke) protected by Cognito; per-integration secrets stored in Secrets Manager; HMAC-SHA256 REQUEST authorizer on the webhook endpoint.
- Better context hydration — Dedicated pre-processing step before the agent runs: gather relevant context (user message, GitHub issue body/comments, optionally recent commits or related paths). Assemble into a structured prompt. Basic version for this iteration: user message + issue body + system prompt template. Advanced sources (related code, linked issues, memory) are added in later iterations.
- Data retention and cleanup — Define and implement retention policies: task record TTL in DynamoDB (e.g. 90 days for completed tasks, configurable), CloudWatch log retention (e.g. 30 days).
- Turn / iteration caps — Complement time-based timeouts with configurable per-task turn limits (default 100, range 1–500). Users can set
max_turnsvia the API or CLI (--max-turns). The value is validated, persisted in the task record, passed through the orchestrator payload, and consumed by the agent'sserver.py→ClaudeAgentOptions(max_turns=...). TheMAX_TURNSenv var on the AgentCore Runtime provides a defense-in-depth fallback. Per-repo overrides viablueprint_configare supported. See ORCHESTRATOR.md. - Cost budget caps — Complement turn limits with a configurable per-task cost budget (
max_budget_usd, range $0.01–$100). When the budget is reached, the agent stops regardless of remaining turns. Users can set via the API (max_budget_usd) or CLI (--max-budget). Per-repo defaults are configurable viablueprint_config.max_budget_usd. Follows a 2-tier override: per-task → Blueprint config; if neither is set, no budget limit is applied. See ORCHESTRATOR.md and COST_MODEL.md. - User prompt guide and anti-patterns — Publish a best-practices guide for writing effective task descriptions. Common anti-patterns are: (1) overly generic prompts that expect the agent to infer intent, and (2) overly specific prompts that break when encountering unexpected scenarios. The guide should include concrete examples of good vs. bad prompts, guidance on when to use issue references vs. free-text descriptions, and tips for defining verifiable goals (e.g. "add tests for X" rather than "make this better"). Can be part of onboarding docs or a standalone user guide. See REPO_ONBOARDING.md and PROMPT_GUIDE.md.
- Agent turn budget awareness — The system prompt now includes the
max_turnsvalue so the agent can prioritize effectively. An agent that knows it has 20 turns left behaves differently from one that doesn't — it avoids excessive exploration and focuses on impactful changes first. Injected via{max_turns}placeholder inagent/system_prompt.py. - Default branch detection — Replaced all hardcoded
mainreferences in the agent harness with dynamic detection viagh repo view --json defaultBranchRef. The system prompt now includes{default_branch}, andensure_pr()targets the detected default branch. Repos usingmaster,develop, ortrunknow work correctly. - Uncommitted work safety net — Added
ensure_committed()as a deterministic post-hook before PR creation. If the agent left uncommitted tracked-file changes (e.g. due to turn limit or timeout), the harness stages them withgit add -uand creates a safety-net commit. Prevents silent loss of agent work. - Pre-agent lint baseline — Added
mise run lintduringsetup_repo()alongside the existingmise run buildbaseline. Records lint state before agent changes so post-agent lint failures can be attributed to the agent (same pattern asbuild_before). - Post-agent lint verification — Added
verify_lint()alongsideverify_build()in post-hooks. Lint pass/fail is recorded in the task result, persisted to DynamoDB, emitted as a span attribute (lint.passed), and included in the PR body's verification section. - Softened commit/PR conventions — The system prompt now instructs the agent to follow the repo's commit conventions if discoverable (from CONTRIBUTING.md, CLAUDE.md, or prior commits), defaulting to conventional commit format only when no repo convention is apparent. Reduces review friction for repos with non-standard commit styles.
- Operator metrics dashboard — CloudWatch Dashboard (
BackgroundAgent-Tasks) providing immediate operator visibility: task success rate, cost per task, turns per task, duration distribution, build/lint pass rates, and AgentCore invocations/errors/latency. Lightweight alternative to the full web control panel (Iteration 4). Seesrc/constructs/task-dashboard.ts. - WAF on API Gateway — AWS WAFv2 Web ACL protects the Task API with AWS managed rule groups (
AWSManagedRulesCommonRuleSet,AWSManagedRulesKnownBadInputsRuleSet) and a rate-based rule (1,000 requests per 5-minute window per IP). Provides edge-layer protection against common web exploits, known bad inputs, and volumetric abuse. See SECURITY.md. - Bedrock model invocation logging — Account-level Bedrock model invocation logging enabled via custom resource, sending prompt and response text to CloudWatch (
/aws/bedrock/model-invocation-logs, 90-day retention). Provides full auditability of model inputs and outputs for prompt injection investigation, compliance, and debugging. - Task description length limit — Task descriptions capped at 2,000 characters (as recommended by the threat model) to bound prompt injection attack surface and prevent oversized payloads.
Builds on Iteration 2: Gateway and orchestration stay; adds onboarding gate, webhook channel, DNS Firewall, better context hydration, turn caps, cost budget caps, prompt guide, data lifecycle, agent harness improvements (turn budget, default branch, safety net, lint verification), operator dashboard, WAF, model invocation logging, and input length limits.
Goal: Agents learn from past interactions; memory Tier 1 (repository knowledge + task execution history) is operational; prompt versioning and commit attribution provide traceability.
- Interaction memory / code attribution (Tier 1) — AgentCore Memory resource provisioned via CDK L2 construct (
@aws-cdk/aws-bedrock-agentcore-alpha) with named semantic (SemanticKnowledge) and episodic (TaskEpisodes) extraction strategies using explicit namespace templates:/{actorId}/knowledge/for semantic records,/{actorId}/episodes/{sessionId}/for per-task episodes, and/{actorId}/episodes/for episodic reflection (cross-task summaries). Events are written withactorId = repo("owner/repo") andsessionId = taskId, so the extraction pipeline places records at/{repo}/knowledge/and/{repo}/episodes/{taskId}/. Memory is loaded at task start during context hydration (two parallelRetrieveMemoryRecordsCommandcalls using repo-derived namespace prefixes —/{repo}/knowledge/for semantic,/{repo}/episodes/for episodic) with a 5-second timeout and 2,000-token budget. Memory is written at task end by the agent (agent/memory.py:write_task_episodeandwrite_repo_learningsviacreate_event). An orchestrator fallback (writeMinimalEpisodeinorchestrator.ts) writes a minimal episode if the agent container crashes or times out. All memory operations are fail-open — failures never block task execution. See MEMORY.md and OBSERVABILITY.md (Code attribution). Implementation:src/constructs/agent-memory.ts,src/handlers/shared/memory.ts,agent/memory.py. - Insights and agent self-feedback — The agent writes structured summaries at the end of each task via
write_task_episode(status, PR URL, cost, duration) andwrite_repo_learnings(codebase patterns and conventions). Agent self-feedback is captured via an "## Agent notes" section in the PR body, extracted post-task by the entrypoint (_extract_agent_notesinagent/entrypoint.py) and stored as part of the task episode. See MEMORY.md (Extraction prompts) and EVALUATION.md. - Prompt versioning — System prompts are hashed (SHA-256 of deterministic prompt parts, excluding memory context which varies per run) via
computePromptVersioninsrc/handlers/shared/prompt-version.ts. Theprompt_versionis stored on the task record in DynamoDB during hydration, enabling future A/B comparison of prompt changes against task outcomes. See EVALUATION.md and ORCHESTRATOR.md (data model). - Per-prompt commit attribution — A
prepare-commit-msggit hook (agent/prepare-commit-msg.sh) is installed during repo setup and appendsTask-Id: <task_id>andPrompt-Version: <hash>trailers to every agent commit. The hook gracefully skips trailers whenTASK_IDis unset (e.g. during manual commits). See MEMORY.md.
Builds on Iteration 3a: Onboarding and per-repo config are in place; adds memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, and commit attribution. These are all write-at-end / read-at-start additions that do not change the orchestrator blueprint.
Goal: Address architectural risks identified by external review before moving to new features. These are fixes to existing code, not new capabilities.
-
Conditional writes in agent task_state.py — Added
ConditionExpressionguards towrite_running()(requires status IN SUBMITTED, HYDRATING) andwrite_terminal()(requires status IN RUNNING, HYDRATING, FINALIZING).ConditionalCheckFailedExceptionis caught bytype(e).__name__(avoids botocore import) and logged as a skip. Prevents the agent from silently overwriting orchestrator-managed CANCELLED status. Seeagent/task_state.py. -
Orchestrator Lambda error alarm — Added CloudWatch alarm on
fn.metricErrors()(threshold: 3, evaluation: 2 periods of 5min, treatMissingData: NOT_BREACHING). Skipped SQS DLQ since durable execution (withDurableExecution, 14-day retention) manages its own retries; a DLQ would conflict. AddedretryAttempts: 0on the alias async invoke config to prevent Lambda-level duplicate invocations. Alarm exported aserrorAlarmpublic property for dashboard/SNS wiring. Seesrc/constructs/task-orchestrator.ts. -
Concurrency counter reconciliation — Implemented
ConcurrencyReconcilerconstruct with a scheduled Lambda (EventBridge rate 15min). Handler scans the concurrency table, queries the task table'sUserStatusIndexGSI per user with aFilterExpressionon active statuses (SUBMITTED, HYDRATING, RUNNING, FINALIZING), compares actual count with storedactive_count, and corrects drift. Seesrc/constructs/concurrency-reconciler.ts,src/handlers/reconcile-concurrency.ts. -
Multi-AZ NAT for production — Already configurable via
AgentVpcProps.natGateways(default: 1) atsrc/constructs/agent-vpc.ts:60. Deployers can setnatGateways: 2or higher for multi-AZ redundancy. No code changes needed — documentation-only update. -
Orchestrator IAM grant for Memory — The orchestrator Lambda had
MEMORY_IDin its env vars and calledloadMemoryContext/writeMinimalEpisode, but was never grantedbedrock-agentcore:RetrieveMemoryRecordsorbedrock-agentcore:CreateEventpermissions. The fail-open pattern silently swallowedAccessDeniedException, making memory appear empty. Fixed by addingagentMemory.grantReadWrite(orchestrator.fn)inagent.ts, with a new stack test asserting the grant. Seesrc/stacks/agent.ts:255. -
Memory schema versioning — Added
schema_version: "2"metadata field to all memory write operations (Python agentmemory.pyand TypeScriptmemory.ts). Enables distinguishing records written under the old namespace scheme (v1,repos/prefix) from the new namespace-template scheme (v2,/{actorId}/knowledge/). Supports future migration tooling and debugging. -
Python repo format validation — Added
_validate_repo()inagent/memory.pythat asserts therepoparameter matches^[a-zA-Z0-9._-]+/[a-zA-Z0-9._-]+$(mirrors TypeScriptisValidRepo). Catches format mismatches (full URLs, extra whitespace, wrong casing) that would cause namespace divergence between write and read paths. -
Severity-aware error logging in Python memory — Replaced bare
except Exceptionblocks with_log_error()helper that distinguishes infrastructure errors (network, auth, throttling → WARN) from programming errors (TypeError,ValueError,AttributeError,KeyError→ ERROR). All exceptions are still caught (fail-open preserved), but bugs surface as ERROR-level logs instead of being hidden at WARN. -
Narrowed entrypoint try-catch — Separated
_extract_agent_notes()extraction from memory writes inagent/entrypoint.py. Agent notes parsing failure now logs"Agent notes extraction failed"(specific) instead of"Memory write failed"(misleading). Memory writes (write_task_episode,write_repo_learnings) are no longer nested inside the same try-catch, since they are individually fail-open. -
Orchestrator fallback episode observability —
writeMinimalEpisodereturn value is now checked and logged:logger.warn('Fallback episode write returned false')when the inner function reports failure via its return value (previously discarded). New testlogs warning when writeMinimalEpisode returns falsecovers this path. -
Python unit tests — Added pytest-based unit tests (
agent/tests/) for pure functions:slugify(),redact_secrets(),format_bytes(),truncate(),build_config(),assemble_prompt(),_discover_project_config(),_build_system_prompt()(entrypoint),_validate_repo()(memory),_now_iso(),_build_logs_url()(task_state). Added pytest to dev dependency group withpythonpathconfig for in-tree imports. -
Decompose entrypoint.py — Extracted four named subfunctions from
run_task()andrun_agent():_build_system_prompt()(system prompt assembly + memory context),_discover_project_config()(repo config scanning),_write_memory()(episode + learnings writes),_setup_agent_env()(Bedrock/OTEL env var setup). All functions stay inentrypoint.py(no import changes).run_task()andrun_agent()now call the extracted functions. -
Deprecate dual prompt assembly — Added deprecation docstring to
assemble_prompt()clarifying that production uses the orchestrator'sassembleUserPrompt()viahydrated_context["user_prompt"]. Python version retained only for local batch mode and dry-run mode. No code deletion — just documentation of the intended flow. -
Graceful thread drain in server.py — Added
_active_threadslist for tracking background threads,_drain_threads(timeout=300)function that joins all alive threads, registered via@app.on_event("shutdown")(FastAPI lifecycle — uvicorn translates SIGTERM) andatexit.register()as backup. Thread list is cleaned on each new invocation. -
Remove dead QUEUED state — Removed
QUEUEDfromTaskStatus,VALID_TRANSITIONS, andACTIVE_STATUSESintask-status.ts. Updated SUBMITTED transitions to[HYDRATING, FAILED, CANCELLED]. Removed QUEUED from all tests (count assertions, cancel test, validation test) and documentation (ORCHESTRATOR.md, OBSERVABILITY.md, API_CONTRACT.md, ARCHITECTURE.md). -
Hardening fixes (review round) — Thread race in
server.py(track thread beforestart()), defensive.get()onClientError.responseintask_state.py, wiredfallback_errorthroughorchestrator.ts(warning log + event metadata), TOCTOUConditionExpressionon reconciler update, per-user error isolation in reconciler,TaskStatusTypepropagation across types/orchestrator/memory, graduated trajectory writer failure, subprocess timeouts, FastAPI lifespan pattern,decrementConcurrencyCCF distinction.
Follow-ups (identified during review, not blocking):
- Reconciler batch error tracking — Added
errorscounter toreconcile-concurrency.ts. Incremented in the per-user catch block. Final log line now includes{ scanned, corrected, errors }. Logs at ERROR iferrors === scanned && scanned > 0(systemic failure). - Test:
decrementConcurrencyCCF path — Added two tests inorchestrate-task.test.ts: one forConditionalCheckFailedException(best-effort, no throw) and one for non-CCF errors (swallowed with warn log, no throw). - Test: reconciler non-CCF update failure — Added test in
reconcile-concurrency.test.ts: two users with drift, user-1'sUpdateItemCommandfails with non-CCF error, user-2 still corrected (per-user error isolation). - Consistent error serialization — Replaced all
String(err)in error/warn log contexts witherr instanceof Error ? err.message : String(err)acrosscontext-hydration.ts,orchestrator.ts,memory.ts, andrepo-config.ts.
Goal: Multi-layered validation catches errors, enforces code quality, and assesses change risk before PRs are created; the platform supports more than one task type; multi-modal input broadens what users can express.
- Per-repo GitHub credentials (GitHub App) — Replace the single shared OAuth token with a GitHub App installed per-organization or per-repository. Each onboarded repo is associated with a GitHub App installation that grants fine-grained permissions (read/write to that repo only). This eliminates the security gap where any authenticated user can trigger agent work against any repo the shared token can access. Token management (installation token generation, rotation) is handled by the platform, not by the agent. AgentCore Identity's token vault can store and refresh installation tokens. This is a prerequisite for any multi-user or multi-team deployment.
- Orchestrator pre-flight checks (fail-closed) — Add a
pre-flightstep beforestart-sessionso doomed tasks fail fast without consuming AgentCore runtime. The orchestrator performs lightweight readiness checks with strict timeouts (for example, 5 seconds): verify GitHub API reachability, verify repository existence and credential access (GET /repos/{owner}/{repo}or equivalent), and optionally verify AgentCore Runtime availability when a status probe exists. If pre-flight fails, the task transitions toFAILEDimmediately with a clear terminal reason (GITHUB_UNREACHABLE,REPO_NOT_FOUND_OR_NO_ACCESS,RUNTIME_UNAVAILABLE), releases the concurrency slot, emits an event/notification, and does not invoke the agent. Unlike memory/context hydration (fail-open), pre-flight is explicitly fail-closed: inability to verify repo access blocks execution by design. - Pre-execution task risk classification — Add a lightweight risk classifier at task submission (before orchestration starts) to drive proportional controls for agent execution. Initial implementation can be rule-based and Blueprint-configurable: prompt keywords (for example,
database,auth,security,infrastructure), metadata from issue labels, and file/path signals when available (for example,**/migrations/**,**/.github/**, infra directories). Persistrisk_level(low/medium/high/critical) on the task record and use it to set defaults and policy: model tier/cascade, turn and budget defaults, prompt strictness/conservatism, approval requirements before merge, and optional autonomous-execution blocks forcriticaltasks. This is intentionally pre-execution and complements (does not replace) post-execution PR risk/blast-radius analysis. - Tiered validation pipeline — Three tiers of post-agent validation run sequentially after the agent finishes but before finalization. Each tier can fail the PR independently, and failure output is fed back to the agent for a fix cycle (capped at 2 retries per tier to bound cost). If the agent still fails, the PR is created with a validation report (labels, comments, and a risk summary) so the reviewer knows. All three tiers are implemented via the blueprint framework's Layer 2 custom steps (
phase: 'post-agent'). See REPO_ONBOARDING.md for the 3-layer customization model, ORCHESTRATOR.md for the step execution contract, and EVALUATION.md for the full design.- Tier 1 — Tool validation (build, test, lint) — Run deterministic tooling: test suites, linters, type checkers, SAST scanners, or a custom script. This is the existing "deterministic validation" concept. Binary pass/fail; failures are concrete (test output, lint errors) and actionable by the agent in a fix cycle. Already partially implemented via the system prompt instructing the agent to run tests.
- Tier 2 — Code quality analysis — Static analysis of the agent's diff against code quality principles: DRY (duplicated code detection), SOLID violations, design pattern adherence, complexity metrics (cyclomatic, cognitive), naming conventions, and repo-specific style rules (from onboarding config). Implemented as an LLM-based review step or a combination of static analysis tools (e.g. SonarQube rules, custom linters) and LLM judgment. Produces structured findings (severity, location, rule, suggestion) that the agent can act on in a fix cycle. Findings below a configurable severity threshold are advisory (included in the PR as comments) rather than blocking.
- Tier 3 — Risk and blast radius analysis — Analyze the scope and impact of the agent's changes to detect unintended side effects in other parts of the codebase. Includes: dependency graph analysis (what modules/functions consume the changed code), change surface area (number of files, lines, and modules touched), semantic impact assessment (does the change alter public APIs, shared types, configuration, or database schemas), and regression risk scoring. Produces a risk level (low / medium / high / critical) attached to the PR as a label and included in the validation report. High-risk changes may require explicit human approval before merge (foundation for the HITL approval mode in Iteration 6). The risk level considers: number of downstream dependents affected, whether the change touches shared infrastructure or core abstractions, test coverage of the affected area, and whether the change introduces new external dependencies.
- PR risk level and validation report — Every agent-created PR includes a structured validation report (as a PR comment or check run) summarizing: Tier 1 results (pass/fail per tool), Tier 2 findings (code quality issues by severity), Tier 3 risk assessment (risk level, blast radius summary, affected modules). The PR is labeled with the computed risk level (
risk:low,risk:medium,risk:high,risk:critical). Risk level is persisted in the task record for evaluation and trending. See EVALUATION.md. - Other task types: PR review — Support at least one additional task type beyond "implement from issue": review pull request (read-only or comment-only). The agent reads the PR diff, runs analysis (tests, lint, code review heuristics), and posts review comments. This uses a different blueprint (no branch creation, no PR creation — just analysis and comments) and a different system prompt. It validates that the platform is not hardwired to a single task type.
- Multi-modal input — Accept text and images (or other modalities) in the task payload; pass through to the agent. Gateway and schema support it; agent harness supports it where available. Primary use case: screenshots of bugs, UI mockups, or design specs attached to issues.
Builds on Iteration 3b: Memory is operational; this iteration changes the orchestrator blueprint (tiered validation pipeline, new task type) and broadens the input schema. These are independently testable from memory.
Goal: The primary feedback loop (PR reviews → memory → future tasks) is operational; automated evaluation provides measurable quality signals; PR outcomes are tracked as feedback.
- Review feedback memory loop (Tier 2) — Capture PR review comments via GitHub webhook, extract actionable rules via LLM, and persist them as searchable memory so the agent internalizes reviewer preferences over time. This is the primary feedback loop between human reviewers and the agent — no shipping coding agent does this today. Requires a GitHub webhook → API Gateway → Lambda pipeline (separate from agent execution). Two types of extracted knowledge: repo-level rules ("don't use
anytypes") and task-specific corrections. See MEMORY.md (Review feedback memory) and SECURITY.md (prompt injection via review comments). - PR outcome tracking — Track whether agent-created PRs are merged, revised, or rejected via GitHub webhooks (
pull_request.closedevents). A merged PR is a positive signal; closed-without-merge is a negative signal. These outcome signals feed into the evaluation pipeline and enable the episodic memory to learn which approaches succeed. See MEMORY.md (PR outcome signals) and EVALUATION.md. - Evaluation pipeline (basic) — Automated evaluation of agent runs: failure categorization (reasoning errors, missed instructions, missing tests, timeouts, tool failures). Results are stored and surfaced in observability dashboards. Basic version: rules-based analysis of task outcomes and agent responses. Track memory effectiveness metrics: first-review merge rate, revision cycles, CI pass rate on first push, review comment density, and repeated mistakes. Advanced version (ML-based trace analysis, A/B prompt comparison, feedback loop into prompts) is deferred to Iteration 5. See EVALUATION.md and OBSERVABILITY.md.
Builds on Iteration 3c: Validation and PR review task type are in place; this iteration adds new infrastructure (webhook → Lambda → LLM extraction pipeline) and connects the feedback loop. Review feedback requires prompt injection mitigations (see SECURITY.md).
Goal: Harden the memory system against both adversarial corruption (prompt injection into memory, poisoned tool outputs, experience grafting) and emergent corruption (hallucination crystallization, feedback loops, stale context accumulation). OWASP classifies this as ASI06 — Memory & Context Poisoning in the 2026 Top 10 for Agentic Applications.
Deep research identified 9 memory-layer security gaps in the current architecture (see the Memory Security Analysis section in MEMORY.md). The platform has strong network-layer security (VPC isolation, DNS Firewall, HTTPS-only egress) but lacks memory content validation, provenance tracking, trust scoring, anomaly detection, and rollback capabilities. Research shows that MINJA-style attacks achieve 95%+ injection success rates against undefended agent memory systems, and that emergent self-corruption (hallucination crystallization, error compounding feedback loops) is equally dangerous because it lacks an external attacker signature.
- Memory content sanitization — Add content validation in
loadMemoryContext()(src/handlers/shared/memory.ts). Scan retrieved memory records for injection patterns (embedded instructions, system prompt overrides, command injection payloads) before including them in the agent's context. Implement asanitizeMemoryContent()function that strips or flags suspicious patterns while preserving legitimate repository knowledge. - GitHub issue input sanitization — Add trust-boundary-aware sanitization in
context-hydration.tsfor GitHub issue bodies and comments. These are attacker-controlled inputs that currently flow into the agent's context without differentiation. Strip control characters, embedded instruction patterns, and known injection payloads. Tag the content source asuntrusted-externalin the hydrated context. - Source provenance on memory writes — Tag all memory writes with source provenance metadata. In
memory.ts(writeMinimalEpisode) andagent/memory.py(write_task_episode,write_repo_learnings), add asource_typefield to event metadata:agent_episode,agent_learning,orchestrator_fallback,github_issue, orreview_feedback. This enables trust-differentiated retrieval in Phase 2. - Content integrity hashing — Add SHA-256 content hashing on all memory writes. Store the hash in event metadata. At read time, verify that content has not been modified between write and read. Implementation: compute hash before
CreateEventCommand, store ascontent_hashmetadata, verify onRetrieveMemoryRecordsCommandresults.
- Trust scoring at retrieval — Modify
loadMemoryContext()to weight retrieved memories by temporal freshness, source type reliability, and pattern consistency with other memories. Memories fromorchestrator_fallbackandagent_episodesources receive higher trust than memories derived from external inputs. Entries below a configurable trust threshold are deprioritized or excluded from the 2,000-token budget. - Configurable temporal decay — Implement per-entry TTL with configurable decay rates. Unverified or externally-sourced memory entries decay faster (e.g., 30-day default) than agent-generated or human-confirmed entries (e.g., 365-day default). Add
trust_tieranddecay_rateto the memory metadata schema. - Memory validation Lambda — Add a lightweight validation function triggered on
CreateEventCommand(via EventBridge rule on AgentCore events or as a post-write hook). The validator runs a classifier that checks whether new memory content looks like legitimate repository knowledge or could influence future agent behavior in unintended ways (the "guardian pattern"). Flag suspicious entries for operator review.
- Memory write anomaly detection — Instrument memory write operations with CloudWatch custom metrics: write frequency per repo, average content length, source type distribution. Add CloudWatch Alarms for anomalous patterns (e.g., burst of writes from a single task, unusually long content, writes with
untrusted-externalsource type exceeding a threshold). - Circuit breaker in orchestrator — Add circuit breaker logic in
orchestrator.ts: if the agent's tool invocation patterns or memory write patterns deviate from a baseline (e.g., sudden increase in memory writes, writes containing instruction-like patterns), pause the task and emit an alert. The circuit breaker transitions the task to a newMEMORY_REVIEWstate that requires operator intervention. - Memory quarantine API — Expose an operator API endpoint (
POST /v1/memory/quarantine,GET /v1/memory/quarantine) for flagging and isolating suspicious memory entries. Quarantined entries are excluded from retrieval but preserved for forensic analysis. - Memory rollback capability — Implement point-in-time memory snapshots. Before each task starts, snapshot the current memory state for the target repo (via the existing
loadMemoryContextpath, persisted to S3). If poisoning is detected post-task, operators can restore the repo's memory to the pre-task snapshot. AddPOST /v1/memory/rollbackendpoint.
- Write-ahead validation (guardian model) — Route proposed memory writes through a smaller, cheaper model (e.g., Haiku) that evaluates whether the content is legitimate learned context or could be adversarial. Adds latency (~100-500ms per write) but catches sophisticated attacks that evade pattern-based sanitization. Configurable per-repo via Blueprint.
- Cross-task behavioral drift detection — Compare agent reasoning patterns and tool invocation sequences across tasks for the same repo. Detect drift from established baselines that could indicate memory-influenced behavioral manipulation. Implemented as a post-task analysis step in the evaluation pipeline.
- Cryptographic provenance chain — Implement Merkle tree-based provenance for memory entry chains per repo. Each new entry includes a hash of the previous entry, creating an append-only, tamper-evident chain. Enables cryptographic verification that no entries have been inserted, modified, or deleted between known-good checkpoints.
- Red team validation — Red team the memory system using published attack methodologies: MINJA (query-based memory injection), AgentPoison (RAG retrieval poisoning), and experience grafting. Document results and adjust defenses. Add automated red team tests to the evaluation pipeline using the DeepTeam framework (OWASP ASI06 attack categories).
- Memory metadata schema changes (
source_type,content_hash,trust_tier,decay_rate) requireschema_version: "3"and are not readable by v2 code paths without migration. - The
MEMORY_REVIEWtask state is a new addition to the state machine (requires orchestrator, API contract, and observability updates). - Trust-scored retrieval changes the memory context budget allocation, which may affect prompt version hashing.
Builds on Iteration 3d: Review feedback memory and PR outcome tracking are in place; this iteration hardens the memory system that those components write to. The 4-phase approach allows incremental deployment with measurable security improvement at each phase.
Goal: Additional git providers; agent can run the app and attach visual proof; Slack integration; web dashboard for operators and users; real-time streaming.
- Additional git providers — Support GitLab (and optionally Bitbucket or others). Same workflow (clone, branch, commit, push, PR/MR). Provider-specific APIs, auth, and webhook adapters. The gateway and task schema are already channel-agnostic (repo is
owner/repo); this iteration adds agit_providerfield and provider-specific adapters. Onboarding (Iter 3a) must support non-GitHub repos. - Live execution and visual proof — Agent can execute the application after build/tests, capture screenshots or videos as proof that changes work, and upload them (e.g. as PR attachments or to an S3 artifact store linked from the PR). Requires compute support: virtual display (Xvfb) or headless browser (Playwright/Puppeteer), capture scripts, and outbound upload. See COMPUTE.md (Visual proof). This may require a larger compute profile (more CPU/RAM/disk) or a dedicated "visual proof" step in the blueprint.
- Slack channel — Slack adapter for the input gateway: users can submit tasks, check status, and receive notifications from Slack. Inbound: verify Slack signing secret, normalize Slack payload to the internal message schema. Outbound: render internal notifications as Slack Block Kit messages, post to the originating channel/thread. Requires a Slack→platform user mapping. See INPUT_GATEWAY.md.
- Automated skills creation pipeline — Pipeline that creates or updates agent skills (or similar artifacts) from repo interaction or from onboarding. For example: the pipeline observes that a repo always requires
npm run lint:fixbefore tests pass, and generates a skill or rule that the agent uses automatically. Builds on customization (Iter 3a) and memory (Iter 3b–3d). - User preference memory (Tier 3) — Per-user memory for PR style, commit conventions, test coverage expectations, and other execution preferences. Extracted from task descriptions (explicit) and review feedback patterns (implicit). Lower priority than repo-level and review feedback memory, but enables personalization when multiple users submit tasks. See MEMORY.md (User preference memory, Tier 3).
- Control panel (web dashboard) — Web UI for operators and users: list tasks (with filters by status, repo, user), view task detail and status history, cancel tasks, link to agent logs, and show basic metrics (active tasks, submitted backlog, completion rate, error rate). Optional: submit a task from the UI (the panel becomes another channel via the input gateway). See CONTROL_PANEL.md. Tech stack TBD (e.g. React + AppSync or REST).
- Real-time event streaming (WebSocket) — Replace or supplement the polling-based
GET /v1/tasks/{id}/eventswith an API Gateway WebSocket API for real-time task status updates. WebSocket is chosen over SSE because multiplayer sessions (Iteration 6) and iterative feedback require bidirectional communication. This improves the experience for the control panel, Slack integration, and CLI--waitmode. Requires connection management (DynamoDB connection table). See API_CONTRACT.md (OQ1). - Live session replay and mid-task nudge — Extend WebSocket streaming with structured trajectory events (thinking steps, tool calls, cost, timing) for real-time session observation and post-hoc replay with timeline scrubbing. Add a "nudge" mechanism to inject one-shot course corrections between agent turns (via TaskNudges table and mid-session message injection). Structured streaming with cost telemetry provides better debugging and operational visibility than raw terminal logs. Requires bidirectional WebSocket (same as real-time streaming) plus agent harness support for consuming nudge messages.
- Browser extension client — A lightweight Chrome/Firefox extension that lets users trigger tasks directly from the browser (e.g. while viewing a GitHub issue, click a button to submit it as a task). The extension calls the existing webhook API (Iteration 3a) with the current page's issue URL, requiring minimal new infrastructure — just a small client-side wrapper over the webhook endpoint. See INPUT_GATEWAY.md.
Builds on Iteration 3d: Onboarding, memory (Tiers 1–2), evaluation, and validation are in place; adds git providers, visual proof, Slack, skills pipeline, user preference memory, control panel, real-time streaming, and browser extension.
Goal: Faster cold start, multi-user/team, full cost management, guardrails, and alternative runtime support.
- Automated container (devbox) from repo — Optionally derive or customize the agent container image from the repo (e.g. Dockerfile, dev container config, language-specific base images). Tied to onboarding: per-repo workload config. Reduces cold start for repos with known environments and ensures the agent has the right tools (compilers, SDKs, linters) pre-installed.
- CI/CD pipeline — Automated deployment pipeline for the platform itself: source → build → test → synth → deploy to staging → deploy to production. Use CDK Pipelines or equivalent. The current
npx projen deployworkflow is not sufficient for a production orchestrator managing long-running tasks — deployments need to be safe (canary, rollback), auditable, and repeatable. - Environment pre-warming (snapshot-on-schedule) — Pre-build container layers or repo snapshots (code + deps pre-installed) per repo; store in ECR or equivalent. Reduces cold start from minutes to seconds for known repos. The onboarding pipeline (Iter 3a) can trigger pre-warming as part of repo setup or on a schedule. Periodically snapshot the onboarded repo's container image (code + deps) to ECR, rebuild on push to the default branch (via webhook or EventBridge), and use that as the base for new sessions. Optionally begin sandbox warming when a user starts composing a task (proactive warming). Snapshot-based session starts (if AgentCore supports it) further reduce startup time. See COMPUTE.md.
- Multi-user / team support — Multiple users with shared task history, team-level visibility, and optionally shared approval queues or budgets. Adds a
team_idororg_idto the task model. Team admins can view all tasks for their team, set team-level concurrency limits, and configure team-wide cost budgets. Builds on existing task model (user_id, filters) and adds authorization rules (team members can view each other's tasks). - Memory isolation for multi-tenancy — AgentCore Memory has no per-namespace IAM isolation. For multi-tenant deployments, private repo knowledge could leak cross-repo unless isolation is enforced. Options: silo model (separate memory resource per org — strongest), pool model (single resource with strict application-layer namespace scoping — sufficient for single-org), or shared model (intentional cross-repo learning — only for same-org repos). The onboarding pipeline should create or assign memory resources based on the isolation model. See SECURITY.md and MEMORY.md.
- Full cost management — per-user and per-team monthly budgets, cost attribution dashboards (cost per task, per repo, per user), alerts when budgets are approaching limits. Token usage and compute cost are tracked per task and aggregated. The control panel (Iter 4) displays cost dashboards.
- Adaptive model router with cost-aware cascade — Per-turn model selection via a lightweight heuristic engine. File reads and simple edits use a cheaper model (Haiku); multi-file refactors use Sonnet; complex reasoning escalates to Opus. Error escalation: if the agent fails twice on the same step, upgrade model for the retry. As the cost budget ceiling approaches, cascade down to cheaper models. Blueprint
modelCascadeconfig enables per-repo tuning. Potential 30-40% cost reduction on inference-dominated workloads. Requires agent harness changes to support mid-session model switching. - Advanced evaluation and feedback loop — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding. Optional patterns from adaptive teaching research (e.g. plan → targeted critique → execution; separate evaluator vs prompt/reflection roles; fitness from LLM judging plus efficiency metrics; evolution of teaching templates from failed trajectories with Pareto-style candidate sets for diverse failure modes) can inform offline or scheduled improvement of Blueprint prompts and checklists without replacing ABCA's core orchestrator.
- Formal orchestrator verification (TLA+) — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active
RUNNINGtask per repo when configured). Keep the spec aligned withsrc/constructs/task-status.tsand orchestrator docs so regressions surface as model-check counterexamples before production. - Guardrails — Natural-language or policy-based guardrails on agent tool calls using Amazon Bedrock Guardrails. Defends against prompt injection, restricts sensitive content generation, and enforces organizational policies (e.g. "do not modify files in
/infrastructure"). See SECURITY.md. Guardrails configuration can be per-repo (via onboarding) or platform-wide. - Capability-based security model — Fine-grained enforcement beyond Bedrock Guardrails, operating at three levels: (1) Tool-level capabilities — Bash command allowlist (git, npm, make permitted; curl, wget blocked), configurable per capability tier (standard / elevated / read-only). (2) File-system scope — Blueprint declares include/exclude path patterns; Write/Edit/Read tools are filtered to the declared scope. (3) Input trust scoring — Authenticated user input = trusted; external GitHub issues = untrusted; PR review comments entering memory = adversarial. Trust level selects the capability set. Essential once review feedback memory (Iter 3d) introduces attacker-controlled content into the agent's context. Blueprint
securityprop configures the capability profile per repo. - Additional execution environment — Support an alternative to AgentCore Runtime (e.g. ECS/Fargate, EKS) behind the ComputeStrategy interface (see REPO_ONBOARDING.md). The orchestrator calls abstract methods (
startSession,stopSession,pollSession); the implementation maps to AgentCore, Fargate, or EKS. Repos select the strategy viacompute_typein their blueprint configuration. Reduces vendor lock-in and enables workloads that exceed AgentCore limits (e.g. GPU, larger images, longer sessions). The ComputeStrategy interface contract is defined in Iteration 3a; Iteration 5 adds alternative implementations. - Full web dashboard — Extend the control panel from Iteration 4: detailed dashboards (cost, performance, evaluation), reasoning trace viewer or log explorer (linked to OpenTelemetry traces from AgentCore), task submit/cancel from the UI, and admin views (system health, capacity, user management).
- Customization (advanced) with tiered tool access — Agent can be extended with MCP servers, plugins, and skills beyond the basic prompt-from-repo customization in Iteration 3a. Composable tool sets per repo. MCP server discovery and lifecycle management. More tools increase behavioral unpredictability, so use a tiered tool access model: a minimal default tool set (bash allowlist, git, verify/lint/test) that all repos get, with MCP servers and plugins as opt-in per repo during onboarding. Per-repo tool profiles are stored in the onboarding config and loaded by the orchestrator. This balances flexibility with predictability. See SECURITY.md and REPO_ONBOARDING.md.
Builds on Iteration 4: Adds pre-warming, multi-user, cost management, guardrails, alternate runtime, and advanced customization with tiered tool access.
Goal: Skills learned from repo interaction; multi-repo tasks; iterative human-agent collaboration; reusable CDK constructs.
- GitHub Actions integration — Publish a GitHub Action that triggers a ABCA task (e.g. on issue label like
agent:fix, on flaky test detection, or on PR comment command). The Action calls the webhook endpoint from Iteration 3a. Natural integration for GitHub-centric workflows. - Automated pipeline for learning skills from repo interaction — Pipeline that observes agent interactions with repositories and produces reusable skills (rules, prompts, tools) that improve future runs. Builds on memory, code attribution, and evaluation. Example: the pipeline notices that tasks on repo X frequently fail because of a missing env variable, and generates a rule that the agent always sets it.
- Agent swarm orchestration — Planner-worker architecture for complex, multi-file tasks that overwhelm a single agent session. A lightweight planner decomposes the task into a DAG of subtasks with scope boundaries and interface contracts. Each subtask runs as an independent child task in its own AgentCore session. A merge orchestrator cherry-picks commits, resolves conflicts, and runs the full test suite before opening one consolidated PR. New DynamoDB fields:
parent_task_id,child_task_ids[],subtask_contract. New blueprint steps:decompose-task, fan-out + wait-all, merge-and-verify. Naturally bounds PR size and enables work that no single-session agent can handle (large features, cross-cutting refactors, migrations). - Multi-repo support — Tasks that span multiple repositories (e.g. change an API in repo A and update the consumer in repo B). Requires: multi-branch orchestration (one branch per repo), coordinated PR creation (linked PRs), cross-repo auth (GitHub App installations for both repos), and cross-repo testing. This is architecturally significant and needs a dedicated design doc before implementation.
- Iterative feedback and multiplayer sessions — User can send follow-up instructions to a completed or running task (e.g. "also add tests for X" or "change the approach to use library Y"). For completed tasks, the platform starts a new session on the same branch with the follow-up context. For running tasks, this requires message injection into a live session — which depends on agent harness support for session persistence and message channels. Design the interaction model carefully: what happens to in-flight work when instructions change? Multiplayer extension: allow multiple authorized users to inject context into a running or follow-up session (e.g. team code reviews or collaborative debugging with the agent). Per-prompt commit attribution (Iter 3b) supports tracking which user's input led to which changes.
- HITL approval mode — Optional mid-task approval gates for high-risk operations (e.g. "agent wants to delete 50 files — approve?"). The orchestrator pauses the task, emits a notification, and waits for user approval before continuing. Requires changes to the agent harness (pause/resume) and the orchestrator (a new
AWAITING_APPROVALstate in the state machine). - Scheduled triggers — Cron or schedule-based task creation (e.g. "run dependency update every Monday", "check for flaky tests nightly"). Implemented as EventBridge Scheduler rules that call the task creation API. Schedules are configured per repo during onboarding or via the control panel.
- CDK constructs — Publish reusable CDK constructs (e.g.
BackgroundAgentStack,OnboardingPipelineStack,TaskOrchestrator) so other teams can compose the platform into their own CDK apps. Document construct APIs, publish to a construct library (e.g. Construct Hub), and version following semver.
Builds on Iteration 5: Leverages memory, evaluation, and customization to close the loop (learn → improve); adds advanced workflows and exposes the platform as constructs.
- Iteration 1 — Core agent + git (isolated run, CLI submit, branch + PR, minimal task state).
- Iteration 2 — Production orchestrator, API contract, task management (list/status/cancel), durable execution, observability, threat model, network isolation, basic cost guardrails, CI/CD.
- Iteration 3a — Repo onboarding, DNS Firewall (domain-level egress filtering), webhook trigger, GitHub Actions, per-repo customization (prompt from repo), data retention, turn/iteration caps, cost budget caps, user prompt guide, agent harness improvements (turn budget, default branch, safety net, lint, softened conventions), operator dashboard, WAF, model invocation logging, input length limits.
- Iteration 3b ✅ — Memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, per-prompt commit attribution. CDK L2 construct with named semantic + episodic strategies using namespace templates (
/{actorId}/knowledge/,/{actorId}/episodes/{sessionId}/), fail-open memory load/write, orchestrator fallback episode, SHA-256 prompt hashing, git trailer attribution. - Iteration 3c — Per-repo GitHub App credentials, orchestrator pre-flight checks (fail-closed before session start), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type, multi-modal input.
- Iteration 3d — Review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic).
- Iteration 3e — Memory security and integrity: input hardening (content sanitization, provenance tagging, integrity hashing), trust-aware retrieval (trust scoring, temporal decay, guardian validation), detection and response (anomaly detection, circuit breaker, quarantine, rollback), advanced protections (write-ahead validation, behavioral drift detection, cryptographic provenance, red teaming). Addresses OWASP ASI06 (Memory & Context Poisoning).
- Iteration 3bis (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (
schema_version: "2"), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition (4 extracted subfunctions), dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active). - Iteration 4 — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production.
- Iteration 5 — Snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, full Bedrock Guardrails (PII, denied topics, output filters), capability-based security model, alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules.
- Iteration 6 — Agent swarm orchestration, skills learning, multi-repo, iterative feedback and multiplayer sessions, HITL approval, scheduled triggers, CDK constructs.
Design docs to keep in sync: ARCHITECTURE.md, ORCHESTRATOR.md, API_CONTRACT.md, INPUT_GATEWAY.md, REPO_ONBOARDING.md, MEMORY.md, OBSERVABILITY.md, COMPUTE.md, CONTROL_PANEL.md, SECURITY.md, EVALUATION.md.