Roadmap

This roadmap outlines multiple iterations for ABCA. Each iteration adds features incrementally and builds on the previous one. Delivering a working slice at the end of each iteration is the goal. Non–backward-compatible changes between iterations are acceptable (e.g. switching CLI auth from IAM to Cognito, or changing the orchestration model) when they simplify the design or align with the target architecture.

The order and scope of items may shift as we learn; the list below reflects current design docs (ARCHITECTURE.md and component docs in docs/design/).

Ongoing engineering practice (cross-iteration)

These practices apply continuously across iterations and are not treated as one-time feature milestones.

Property-based correctness testing for orchestration invariants — Complement example-based tests (Jest/pytest) with property-based testing (fast-check for TypeScript and hypothesis for Python) so randomized inputs and interleavings validate invariants over many runs. The goal is to verify safety properties that are timing-sensitive or hard to cover with scenario tests alone (for example, concurrent state transitions and lock/contention behavior).
Machine-readable property catalog — Maintain a versioned property set with explicit mapping from each property to enforcing code paths and tests. Initial properties include:
- P-ABCA-1 terminal-state immutability: tasks in COMPLETED / FAILED / CANCELLED / TIMED_OUT cannot transition further.
- P-ABCA-2 concurrency counter consistency: for each user, active_count equals the number of tasks in active states (SUBMITTED, HYDRATING, RUNNING, FINALIZING).
- P-ABCA-3 event ordering: TaskEvents are strictly monotonic by event_id (ULID order).
- P-ABCA-4 memory fallback guarantee: if task finalization sees memory_written = false, fallback episode write is attempted and result is observable.
- P-ABCA-5 branch-name uniqueness: simultaneous tasks for the same repo generate distinct branch names (ULID-based suffix).
Definition-of-done hook — New orchestrator/concurrency changes should include: updated property mappings, at least one property-based test where applicable, and invariant notes in ORCHESTRATOR.md to keep docs and executable checks aligned.

Iteration 1 — First shippable slice (done)

Goal: An agent runs on AWS in an isolated environment; user submits a task from the CLI and gets a PR when done.

Agent on AWS — Agent runs in a sandboxed compute environment (AgentCore Runtime MicroVM or equivalent). Each task gets an isolated session (compute, memory, filesystem). Container/image has shell, filesystem, dev tooling; session isolation is built-in.
CLI trigger — User can submit a task via CLI (script or simple CLI): provide repo + task description (text and/or GitHub issue ref). Single entry path; no multi-channel yet.
Autonomous agent loop — Agent SDK runs with full tool access in headless mode (read, write, edit, bash, glob, grep; permissionMode: "bypassPermissions" or equivalent). No human prompts during execution.
Git workflow — Agent creates a branch, commits incrementally, pushes to GitHub, and creates a pull request when done. Branch naming convention: e.g. bgagent/<task-id>/<short-desc>.
GitHub only — Single git provider (GitHub). Agent clones repo, works on branch, opens PR via GitHub API (OAuth or token via AgentCore Identity).
Minimal orchestration — Task is created, execution is triggered (e.g. Lambda or direct invoke), agent runs to completion or failure. Platform infers outcome from GitHub (PR created or not) or from session end. No durable orchestration (e.g. no Step Functions / Durable Functions) required for this slice if we accept "fire-and-forget" plus polling.
Task state (minimal) — At least: task id, status (e.g. running / completed / failed), repo, and way to poll or wait for completion. Persistence can be minimal (e.g. DynamoDB or single table).
API authentication — CLI authenticates to the API (e.g. IAM SigV4 or Cognito JWT). Prevents unauthorized task submission.
Scaling — Each task runs in its own isolated session; no shared mutable state so the system can scale with concurrent tasks (within runtime quotas).

Out of scope for Iteration 1: Repo onboarding (any repo the credentials can access is allowed), multiple channels, durable execution with checkpoint/resume, rich observability, memory/code attribution, webhook, Slack.

Iteration 2 — Production orchestrator, task management, and observability (done)

Goal: Robust task lifecycle, durable execution, security foundations, basic cost guardrails, and visibility into what's running. This iteration makes the platform production-grade for single-channel (CLI) usage.

Task management and API

Task management — Submit, list (e.g. my tasks), get status (per task), cancel (stop a running task). Clear task state machine (SUBMITTED → HYDRATING → RUNNING → FINALIZING → COMPLETED / FAILED / CANCELLED / TIMED_OUT). See ORCHESTRATOR.md.
API contract — Implement the external API: POST /v1/tasks, GET /v1/tasks, GET /v1/tasks/{id}, DELETE /v1/tasks/{id}, GET /v1/tasks/{id}/events. Consistent error format, pagination, idempotency. See API_CONTRACT.md.
Input gateway (single entry point) — All requests go through one gateway: verify auth, normalize payload to an internal message schema, validate (required fields, repo/issue refs), then dispatch to the task pipeline. The gateway is designed for extensibility — adding new channels later requires only new adapters, not core changes. In this iteration, CLI is the only channel; the gateway architecture is established so future channels (webhook, Slack) plug in cleanly. See INPUT_GATEWAY.md.
Idempotency — Task submit accepts an idempotency key (e.g. Idempotency-Key header); duplicate submits with the same key do not create a second task. Prevents duplicate work on retries. Keys are stored with a 24-hour TTL.
Improve CLI — Dedicated CLI package (@abca/cli in cli/) with commands: configure, login, submit, list, status, cancel, events. Cognito auth with token caching and auto-refresh, --wait mode that polls until completion, --output json for scripting, and --verbose for debugging.

Orchestration and storage

Durable execution — Orchestrator on top of the agent using Lambda Durable Functions: checkpoint/resume, async session monitoring via DynamoDB polling, timeout recovery, idempotent step execution. Long-running sessions (hours) survive transient failures; agent commits regularly so work is not lost. See ORCHESTRATOR.md for the task state machine, execution model, failure modes, concurrency management, data model, and implementation strategy.
Storage — (1) Task and event storage — Tasks table, TaskEvents (audit log), UserConcurrency counters in DynamoDB. (2) Durable execution state — Lambda Durable Functions checkpoints (managed by the service). (3) Artifact storage (optional) — S3 bucket for future screenshot/video uploads.

Security and network

Threat model — Document the threat model for the current architecture using threat-composer. Cover: input validation, agent isolation, credential management, data flow, and trust boundaries. Update the threat model as new features land in future iterations. Threat modeling informs the security controls built in this and subsequent iterations — it must come before, not after, the production gateway and orchestrator.
Network isolation (basic) — Deploy the agent compute environment within a VPC. Restrict outbound egress to allowlisted endpoints: GitHub API, Amazon Bedrock, AgentCore services, and necessary AWS service endpoints (DynamoDB, CloudWatch, S3). No open internet access by default. This prevents a compromised or confused agent from reaching arbitrary endpoints. Fine-grained per-repo allowlisting and egress logging are deferred to Iteration 3a.

Cost and observability

Observability — Metrics: task duration, token usage (from agent SDK result), cold start, error rate, active task counts, and submitted backlog. Dashboards: active tasks, submitted backlog, completion rate, basic task list. Alarms: stuck tasks (e.g. RUNNING > 9 hours), sustained submitted backlog over threshold, orchestration failures, counter drift. Logs: Agent/runtime logs (e.g. CloudWatch) tied to task id. See OBSERVABILITY.md.

Platform operations

Builds on Iteration 1: Same agent + git workflow; adds orchestrator, gateway, task CRUD, API contract, observability, security foundations, and cost guardrails.

Out of scope for Iteration 2: Webhook trigger (no second channel yet), multi-modal input (text-based tasks are sufficient), repo onboarding, memory, customization.

Iteration 3 (wip, we are here — 3a and 3b done)

Iteration 3a — Repo onboarding and access control

Goal: Only onboarded repos can receive tasks; per-repo credentials replace the single shared OAuth token; agent environment is customizable per repo.

Builds on Iteration 2: Gateway and orchestration stay; adds onboarding gate, webhook channel, DNS Firewall, better context hydration, turn caps, cost budget caps, prompt guide, data lifecycle, agent harness improvements (turn budget, default branch, safety net, lint verification), operator dashboard, WAF, model invocation logging, and input length limits.

Iteration 3b — Core memory and learning (done)

Goal: Agents learn from past interactions; memory Tier 1 (repository knowledge + task execution history) is operational; prompt versioning and commit attribution provide traceability.

Interaction memory / code attribution (Tier 1) — AgentCore Memory resource provisioned via CDK L2 construct (@aws-cdk/aws-bedrock-agentcore-alpha) with named semantic (SemanticKnowledge) and episodic (TaskEpisodes) extraction strategies using explicit namespace templates: /{actorId}/knowledge/ for semantic records, /{actorId}/episodes/{sessionId}/ for per-task episodes, and /{actorId}/episodes/ for episodic reflection (cross-task summaries). Events are written with actorId = repo ("owner/repo") and sessionId = taskId, so the extraction pipeline places records at /{repo}/knowledge/ and /{repo}/episodes/{taskId}/. Memory is loaded at task start during context hydration (two parallel RetrieveMemoryRecordsCommand calls using repo-derived namespace prefixes — /{repo}/knowledge/ for semantic, /{repo}/episodes/ for episodic) with a 5-second timeout and 2,000-token budget. Memory is written at task end by the agent (agent/memory.py: write_task_episode and write_repo_learnings via create_event). An orchestrator fallback (writeMinimalEpisode in orchestrator.ts) writes a minimal episode if the agent container crashes or times out. All memory operations are fail-open — failures never block task execution. See MEMORY.md and OBSERVABILITY.md (Code attribution). Implementation: src/constructs/agent-memory.ts, src/handlers/shared/memory.ts, agent/memory.py.
Insights and agent self-feedback — The agent writes structured summaries at the end of each task via write_task_episode (status, PR URL, cost, duration) and write_repo_learnings (codebase patterns and conventions). Agent self-feedback is captured via an "## Agent notes" section in the PR body, extracted post-task by the entrypoint (_extract_agent_notes in agent/entrypoint.py) and stored as part of the task episode. See MEMORY.md (Extraction prompts) and EVALUATION.md.
Prompt versioning — System prompts are hashed (SHA-256 of deterministic prompt parts, excluding memory context which varies per run) via computePromptVersion in src/handlers/shared/prompt-version.ts. The prompt_version is stored on the task record in DynamoDB during hydration, enabling future A/B comparison of prompt changes against task outcomes. See EVALUATION.md and ORCHESTRATOR.md (data model).
Per-prompt commit attribution — A prepare-commit-msg git hook (agent/prepare-commit-msg.sh) is installed during repo setup and appends Task-Id: <task_id> and Prompt-Version: <hash> trailers to every agent commit. The hook gracefully skips trailers when TASK_ID is unset (e.g. during manual commits). See MEMORY.md.

Builds on Iteration 3a: Onboarding and per-repo config are in place; adds memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, and commit attribution. These are all write-at-end / read-at-start additions that do not change the orchestrator blueprint.

Iteration 3bis

Goal: Address architectural risks identified by external review before moving to new features. These are fixes to existing code, not new capabilities.

Follow-ups (identified during review, not blocking):

Reconciler batch error tracking — Added errors counter to reconcile-concurrency.ts. Incremented in the per-user catch block. Final log line now includes { scanned, corrected, errors }. Logs at ERROR if errors === scanned && scanned > 0 (systemic failure).
Test: decrementConcurrency CCF path — Added two tests in orchestrate-task.test.ts: one for ConditionalCheckFailedException (best-effort, no throw) and one for non-CCF errors (swallowed with warn log, no throw).
Test: reconciler non-CCF update failure — Added test in reconcile-concurrency.test.ts: two users with drift, user-1's UpdateItemCommand fails with non-CCF error, user-2 still corrected (per-user error isolation).
Consistent error serialization — Replaced all String(err) in error/warn log contexts with err instanceof Error ? err.message : String(err) across context-hydration.ts, orchestrator.ts, memory.ts, and repo-config.ts.

Iteration 3c — Validation and new task types

Goal: Multi-layered validation catches errors, enforces code quality, and assesses change risk before PRs are created; the platform supports more than one task type; multi-modal input broadens what users can express.

Per-repo GitHub credentials (GitHub App) — Replace the single shared OAuth token with a GitHub App installed per-organization or per-repository. Each onboarded repo is associated with a GitHub App installation that grants fine-grained permissions (read/write to that repo only). This eliminates the security gap where any authenticated user can trigger agent work against any repo the shared token can access. Token management (installation token generation, rotation) is handled by the platform, not by the agent. AgentCore Identity's token vault can store and refresh installation tokens. This is a prerequisite for any multi-user or multi-team deployment.
Orchestrator pre-flight checks (fail-closed) — Add a pre-flight step before start-session so doomed tasks fail fast without consuming AgentCore runtime. The orchestrator performs lightweight readiness checks with strict timeouts (for example, 5 seconds): verify GitHub API reachability, verify repository existence and credential access (GET /repos/{owner}/{repo} or equivalent), and optionally verify AgentCore Runtime availability when a status probe exists. If pre-flight fails, the task transitions to FAILED immediately with a clear terminal reason (GITHUB_UNREACHABLE, REPO_NOT_FOUND_OR_NO_ACCESS, RUNTIME_UNAVAILABLE), releases the concurrency slot, emits an event/notification, and does not invoke the agent. Unlike memory/context hydration (fail-open), pre-flight is explicitly fail-closed: inability to verify repo access blocks execution by design.
Pre-execution task risk classification — Add a lightweight risk classifier at task submission (before orchestration starts) to drive proportional controls for agent execution. Initial implementation can be rule-based and Blueprint-configurable: prompt keywords (for example, database, auth, security, infrastructure), metadata from issue labels, and file/path signals when available (for example, **/migrations/**, **/.github/**, infra directories). Persist risk_level (low / medium / high / critical) on the task record and use it to set defaults and policy: model tier/cascade, turn and budget defaults, prompt strictness/conservatism, approval requirements before merge, and optional autonomous-execution blocks for critical tasks. This is intentionally pre-execution and complements (does not replace) post-execution PR risk/blast-radius analysis.
Tiered validation pipeline — Three tiers of post-agent validation run sequentially after the agent finishes but before finalization. Each tier can fail the PR independently, and failure output is fed back to the agent for a fix cycle (capped at 2 retries per tier to bound cost). If the agent still fails, the PR is created with a validation report (labels, comments, and a risk summary) so the reviewer knows. All three tiers are implemented via the blueprint framework's Layer 2 custom steps (phase: 'post-agent'). See REPO_ONBOARDING.md for the 3-layer customization model, ORCHESTRATOR.md for the step execution contract, and EVALUATION.md for the full design.
- Tier 1 — Tool validation (build, test, lint) — Run deterministic tooling: test suites, linters, type checkers, SAST scanners, or a custom script. This is the existing "deterministic validation" concept. Binary pass/fail; failures are concrete (test output, lint errors) and actionable by the agent in a fix cycle. Already partially implemented via the system prompt instructing the agent to run tests.
- Tier 2 — Code quality analysis — Static analysis of the agent's diff against code quality principles: DRY (duplicated code detection), SOLID violations, design pattern adherence, complexity metrics (cyclomatic, cognitive), naming conventions, and repo-specific style rules (from onboarding config). Implemented as an LLM-based review step or a combination of static analysis tools (e.g. SonarQube rules, custom linters) and LLM judgment. Produces structured findings (severity, location, rule, suggestion) that the agent can act on in a fix cycle. Findings below a configurable severity threshold are advisory (included in the PR as comments) rather than blocking.
- Tier 3 — Risk and blast radius analysis — Analyze the scope and impact of the agent's changes to detect unintended side effects in other parts of the codebase. Includes: dependency graph analysis (what modules/functions consume the changed code), change surface area (number of files, lines, and modules touched), semantic impact assessment (does the change alter public APIs, shared types, configuration, or database schemas), and regression risk scoring. Produces a risk level (low / medium / high / critical) attached to the PR as a label and included in the validation report. High-risk changes may require explicit human approval before merge (foundation for the HITL approval mode in Iteration 6). The risk level considers: number of downstream dependents affected, whether the change touches shared infrastructure or core abstractions, test coverage of the affected area, and whether the change introduces new external dependencies.
PR risk level and validation report — Every agent-created PR includes a structured validation report (as a PR comment or check run) summarizing: Tier 1 results (pass/fail per tool), Tier 2 findings (code quality issues by severity), Tier 3 risk assessment (risk level, blast radius summary, affected modules). The PR is labeled with the computed risk level (risk:low, risk:medium, risk:high, risk:critical). Risk level is persisted in the task record for evaluation and trending. See EVALUATION.md.
Other task types: PR review — Support at least one additional task type beyond "implement from issue": review pull request (read-only or comment-only). The agent reads the PR diff, runs analysis (tests, lint, code review heuristics), and posts review comments. This uses a different blueprint (no branch creation, no PR creation — just analysis and comments) and a different system prompt. It validates that the platform is not hardwired to a single task type.
Multi-modal input — Accept text and images (or other modalities) in the task payload; pass through to the agent. Gateway and schema support it; agent harness supports it where available. Primary use case: screenshots of bugs, UI mockups, or design specs attached to issues.

Builds on Iteration 3b: Memory is operational; this iteration changes the orchestrator blueprint (tiered validation pipeline, new task type) and broadens the input schema. These are independently testable from memory.

Iteration 3d — Review feedback loop and evaluation

Goal: The primary feedback loop (PR reviews → memory → future tasks) is operational; automated evaluation provides measurable quality signals; PR outcomes are tracked as feedback.

Review feedback memory loop (Tier 2) — Capture PR review comments via GitHub webhook, extract actionable rules via LLM, and persist them as searchable memory so the agent internalizes reviewer preferences over time. This is the primary feedback loop between human reviewers and the agent — no shipping coding agent does this today. Requires a GitHub webhook → API Gateway → Lambda pipeline (separate from agent execution). Two types of extracted knowledge: repo-level rules ("don't use any types") and task-specific corrections. See MEMORY.md (Review feedback memory) and SECURITY.md (prompt injection via review comments).
PR outcome tracking — Track whether agent-created PRs are merged, revised, or rejected via GitHub webhooks (pull_request.closed events). A merged PR is a positive signal; closed-without-merge is a negative signal. These outcome signals feed into the evaluation pipeline and enable the episodic memory to learn which approaches succeed. See MEMORY.md (PR outcome signals) and EVALUATION.md.
Evaluation pipeline (basic) — Automated evaluation of agent runs: failure categorization (reasoning errors, missed instructions, missing tests, timeouts, tool failures). Results are stored and surfaced in observability dashboards. Basic version: rules-based analysis of task outcomes and agent responses. Track memory effectiveness metrics: first-review merge rate, revision cycles, CI pass rate on first push, review comment density, and repeated mistakes. Advanced version (ML-based trace analysis, A/B prompt comparison, feedback loop into prompts) is deferred to Iteration 5. See EVALUATION.md and OBSERVABILITY.md.

Builds on Iteration 3c: Validation and PR review task type are in place; this iteration adds new infrastructure (webhook → Lambda → LLM extraction pipeline) and connects the feedback loop. Review feedback requires prompt injection mitigations (see SECURITY.md).

Iteration 3e — Memory security and integrity

Goal: Harden the memory system against both adversarial corruption (prompt injection into memory, poisoned tool outputs, experience grafting) and emergent corruption (hallucination crystallization, feedback loops, stale context accumulation). OWASP classifies this as ASI06 — Memory & Context Poisoning in the 2026 Top 10 for Agentic Applications.

Background

Deep research identified 9 memory-layer security gaps in the current architecture (see the Memory Security Analysis section in MEMORY.md). The platform has strong network-layer security (VPC isolation, DNS Firewall, HTTPS-only egress) but lacks memory content validation, provenance tracking, trust scoring, anomaly detection, and rollback capabilities. Research shows that MINJA-style attacks achieve 95%+ injection success rates against undefended agent memory systems, and that emergent self-corruption (hallucination crystallization, error compounding feedback loops) is equally dangerous because it lacks an external attacker signature.

Phase 1 — Input hardening

Memory content sanitization — Add content validation in loadMemoryContext() (src/handlers/shared/memory.ts). Scan retrieved memory records for injection patterns (embedded instructions, system prompt overrides, command injection payloads) before including them in the agent's context. Implement a sanitizeMemoryContent() function that strips or flags suspicious patterns while preserving legitimate repository knowledge.
GitHub issue input sanitization — Add trust-boundary-aware sanitization in context-hydration.ts for GitHub issue bodies and comments. These are attacker-controlled inputs that currently flow into the agent's context without differentiation. Strip control characters, embedded instruction patterns, and known injection payloads. Tag the content source as untrusted-external in the hydrated context.
Source provenance on memory writes — Tag all memory writes with source provenance metadata. In memory.ts (writeMinimalEpisode) and agent/memory.py (write_task_episode, write_repo_learnings), add a source_type field to event metadata: agent_episode, agent_learning, orchestrator_fallback, github_issue, or review_feedback. This enables trust-differentiated retrieval in Phase 2.
Content integrity hashing — Add SHA-256 content hashing on all memory writes. Store the hash in event metadata. At read time, verify that content has not been modified between write and read. Implementation: compute hash before CreateEventCommand, store as content_hash metadata, verify on RetrieveMemoryRecordsCommand results.

Phase 2 — Trust-aware retrieval

Trust scoring at retrieval — Modify loadMemoryContext() to weight retrieved memories by temporal freshness, source type reliability, and pattern consistency with other memories. Memories from orchestrator_fallback and agent_episode sources receive higher trust than memories derived from external inputs. Entries below a configurable trust threshold are deprioritized or excluded from the 2,000-token budget.
Configurable temporal decay — Implement per-entry TTL with configurable decay rates. Unverified or externally-sourced memory entries decay faster (e.g., 30-day default) than agent-generated or human-confirmed entries (e.g., 365-day default). Add trust_tier and decay_rate to the memory metadata schema.
Memory validation Lambda — Add a lightweight validation function triggered on CreateEventCommand (via EventBridge rule on AgentCore events or as a post-write hook). The validator runs a classifier that checks whether new memory content looks like legitimate repository knowledge or could influence future agent behavior in unintended ways (the "guardian pattern"). Flag suspicious entries for operator review.

Phase 3 — Detection and response

Memory write anomaly detection — Instrument memory write operations with CloudWatch custom metrics: write frequency per repo, average content length, source type distribution. Add CloudWatch Alarms for anomalous patterns (e.g., burst of writes from a single task, unusually long content, writes with untrusted-external source type exceeding a threshold).
Circuit breaker in orchestrator — Add circuit breaker logic in orchestrator.ts: if the agent's tool invocation patterns or memory write patterns deviate from a baseline (e.g., sudden increase in memory writes, writes containing instruction-like patterns), pause the task and emit an alert. The circuit breaker transitions the task to a new MEMORY_REVIEW state that requires operator intervention.
Memory quarantine API — Expose an operator API endpoint (POST /v1/memory/quarantine, GET /v1/memory/quarantine) for flagging and isolating suspicious memory entries. Quarantined entries are excluded from retrieval but preserved for forensic analysis.
Memory rollback capability — Implement point-in-time memory snapshots. Before each task starts, snapshot the current memory state for the target repo (via the existing loadMemoryContext path, persisted to S3). If poisoning is detected post-task, operators can restore the repo's memory to the pre-task snapshot. Add POST /v1/memory/rollback endpoint.

Phase 4 — Advanced protections

Write-ahead validation (guardian model) — Route proposed memory writes through a smaller, cheaper model (e.g., Haiku) that evaluates whether the content is legitimate learned context or could be adversarial. Adds latency (~100-500ms per write) but catches sophisticated attacks that evade pattern-based sanitization. Configurable per-repo via Blueprint.
Cross-task behavioral drift detection — Compare agent reasoning patterns and tool invocation sequences across tasks for the same repo. Detect drift from established baselines that could indicate memory-influenced behavioral manipulation. Implemented as a post-task analysis step in the evaluation pipeline.
Cryptographic provenance chain — Implement Merkle tree-based provenance for memory entry chains per repo. Each new entry includes a hash of the previous entry, creating an append-only, tamper-evident chain. Enables cryptographic verification that no entries have been inserted, modified, or deleted between known-good checkpoints.
Red team validation — Red team the memory system using published attack methodologies: MINJA (query-based memory injection), AgentPoison (RAG retrieval poisoning), and experience grafting. Document results and adjust defenses. Add automated red team tests to the evaluation pipeline using the DeepTeam framework (OWASP ASI06 attack categories).

Non-backward-compatible changes

Memory metadata schema changes (source_type, content_hash, trust_tier, decay_rate) require schema_version: "3" and are not readable by v2 code paths without migration.
The MEMORY_REVIEW task state is a new addition to the state machine (requires orchestrator, API contract, and observability updates).
Trust-scored retrieval changes the memory context budget allocation, which may affect prompt version hashing.

Builds on Iteration 3d: Review feedback memory and PR outcome tracking are in place; this iteration hardens the memory system that those components write to. The 4-phase approach allows incremental deployment with measurable security improvement at each phase.

Iteration 4 — Integrations, visual proof, and control panel

Goal: Additional git providers; agent can run the app and attach visual proof; Slack integration; web dashboard for operators and users; real-time streaming.

Additional git providers — Support GitLab (and optionally Bitbucket or others). Same workflow (clone, branch, commit, push, PR/MR). Provider-specific APIs, auth, and webhook adapters. The gateway and task schema are already channel-agnostic (repo is owner/repo); this iteration adds a git_provider field and provider-specific adapters. Onboarding (Iter 3a) must support non-GitHub repos.
Live execution and visual proof — Agent can execute the application after build/tests, capture screenshots or videos as proof that changes work, and upload them (e.g. as PR attachments or to an S3 artifact store linked from the PR). Requires compute support: virtual display (Xvfb) or headless browser (Playwright/Puppeteer), capture scripts, and outbound upload. See COMPUTE.md (Visual proof). This may require a larger compute profile (more CPU/RAM/disk) or a dedicated "visual proof" step in the blueprint.
Slack channel — Slack adapter for the input gateway: users can submit tasks, check status, and receive notifications from Slack. Inbound: verify Slack signing secret, normalize Slack payload to the internal message schema. Outbound: render internal notifications as Slack Block Kit messages, post to the originating channel/thread. Requires a Slack→platform user mapping. See INPUT_GATEWAY.md.
Automated skills creation pipeline — Pipeline that creates or updates agent skills (or similar artifacts) from repo interaction or from onboarding. For example: the pipeline observes that a repo always requires npm run lint:fix before tests pass, and generates a skill or rule that the agent uses automatically. Builds on customization (Iter 3a) and memory (Iter 3b–3d).
User preference memory (Tier 3) — Per-user memory for PR style, commit conventions, test coverage expectations, and other execution preferences. Extracted from task descriptions (explicit) and review feedback patterns (implicit). Lower priority than repo-level and review feedback memory, but enables personalization when multiple users submit tasks. See MEMORY.md (User preference memory, Tier 3).
Control panel (web dashboard) — Web UI for operators and users: list tasks (with filters by status, repo, user), view task detail and status history, cancel tasks, link to agent logs, and show basic metrics (active tasks, submitted backlog, completion rate, error rate). Optional: submit a task from the UI (the panel becomes another channel via the input gateway). See CONTROL_PANEL.md. Tech stack TBD (e.g. React + AppSync or REST).
Real-time event streaming (WebSocket) — Replace or supplement the polling-based GET /v1/tasks/{id}/events with an API Gateway WebSocket API for real-time task status updates. WebSocket is chosen over SSE because multiplayer sessions (Iteration 6) and iterative feedback require bidirectional communication. This improves the experience for the control panel, Slack integration, and CLI --wait mode. Requires connection management (DynamoDB connection table). See API_CONTRACT.md (OQ1).
Live session replay and mid-task nudge — Extend WebSocket streaming with structured trajectory events (thinking steps, tool calls, cost, timing) for real-time session observation and post-hoc replay with timeline scrubbing. Add a "nudge" mechanism to inject one-shot course corrections between agent turns (via TaskNudges table and mid-session message injection). Structured streaming with cost telemetry provides better debugging and operational visibility than raw terminal logs. Requires bidirectional WebSocket (same as real-time streaming) plus agent harness support for consuming nudge messages.
Browser extension client — A lightweight Chrome/Firefox extension that lets users trigger tasks directly from the browser (e.g. while viewing a GitHub issue, click a button to submit it as a task). The extension calls the existing webhook API (Iteration 3a) with the current page's issue URL, requiring minimal new infrastructure — just a small client-side wrapper over the webhook endpoint. See INPUT_GATEWAY.md.

Builds on Iteration 3d: Onboarding, memory (Tiers 1–2), evaluation, and validation are in place; adds git providers, visual proof, Slack, skills pipeline, user preference memory, control panel, real-time streaming, and browser extension.

Iteration 5 — Scale, cost, and platform maturity

Goal: Faster cold start, multi-user/team, full cost management, guardrails, and alternative runtime support.

Automated container (devbox) from repo — Optionally derive or customize the agent container image from the repo (e.g. Dockerfile, dev container config, language-specific base images). Tied to onboarding: per-repo workload config. Reduces cold start for repos with known environments and ensures the agent has the right tools (compilers, SDKs, linters) pre-installed.
CI/CD pipeline — Automated deployment pipeline for the platform itself: source → build → test → synth → deploy to staging → deploy to production. Use CDK Pipelines or equivalent. The current npx projen deploy workflow is not sufficient for a production orchestrator managing long-running tasks — deployments need to be safe (canary, rollback), auditable, and repeatable.
Environment pre-warming (snapshot-on-schedule) — Pre-build container layers or repo snapshots (code + deps pre-installed) per repo; store in ECR or equivalent. Reduces cold start from minutes to seconds for known repos. The onboarding pipeline (Iter 3a) can trigger pre-warming as part of repo setup or on a schedule. Periodically snapshot the onboarded repo's container image (code + deps) to ECR, rebuild on push to the default branch (via webhook or EventBridge), and use that as the base for new sessions. Optionally begin sandbox warming when a user starts composing a task (proactive warming). Snapshot-based session starts (if AgentCore supports it) further reduce startup time. See COMPUTE.md.
Multi-user / team support — Multiple users with shared task history, team-level visibility, and optionally shared approval queues or budgets. Adds a team_id or org_id to the task model. Team admins can view all tasks for their team, set team-level concurrency limits, and configure team-wide cost budgets. Builds on existing task model (user_id, filters) and adds authorization rules (team members can view each other's tasks).
Memory isolation for multi-tenancy — AgentCore Memory has no per-namespace IAM isolation. For multi-tenant deployments, private repo knowledge could leak cross-repo unless isolation is enforced. Options: silo model (separate memory resource per org — strongest), pool model (single resource with strict application-layer namespace scoping — sufficient for single-org), or shared model (intentional cross-repo learning — only for same-org repos). The onboarding pipeline should create or assign memory resources based on the isolation model. See SECURITY.md and MEMORY.md.
Full cost management — per-user and per-team monthly budgets, cost attribution dashboards (cost per task, per repo, per user), alerts when budgets are approaching limits. Token usage and compute cost are tracked per task and aggregated. The control panel (Iter 4) displays cost dashboards.
Adaptive model router with cost-aware cascade — Per-turn model selection via a lightweight heuristic engine. File reads and simple edits use a cheaper model (Haiku); multi-file refactors use Sonnet; complex reasoning escalates to Opus. Error escalation: if the agent fails twice on the same step, upgrade model for the retry. As the cost budget ceiling approaches, cascade down to cheaper models. Blueprint modelCascade config enables per-repo tuning. Potential 30-40% cost reduction on inference-dominated workloads. Requires agent harness changes to support mid-session model switching.
Advanced evaluation and feedback loop — Extend the basic evaluation pipeline from Iteration 3d: ML-based or LLM-based trace analysis (not just rules), A/B prompt comparison framework, automated feedback into prompt templates (e.g. "for repo X, always run tests before opening PR"), and per-repo or per-failure-type improvement tracking. Evaluation results can update the repo's agent configuration stored during onboarding. Optional patterns from adaptive teaching research (e.g. plan → targeted critique → execution; separate evaluator vs prompt/reflection roles; fitness from LLM judging plus efficiency metrics; evolution of teaching templates from failed trajectories with Pareto-style candidate sets for diverse failure modes) can inform offline or scheduled improvement of Blueprint prompts and checklists without replacing ABCA's core orchestrator.
Formal orchestrator verification (TLA+) — Add a formal specification of the orchestrator in TLA+ and verify it with TLC model checking. Scope includes the task state machine (8 states, valid transitions, terminal states), concurrency admission control (atomic increment + max check), cancellation races (cancel arriving during any orchestration step), reconciler/orchestrator interleavings (counter drift correction while tasks are active), and the polling loop (agent writes terminal status, orchestrator observes and finalizes). Define invariants such as valid-state progression, no illegal transitions, and repo-level safety constraints (for example, at most one active RUNNING task per repo when configured). Keep the spec aligned with src/constructs/task-status.ts and orchestrator docs so regressions surface as model-check counterexamples before production.
Guardrails — Natural-language or policy-based guardrails on agent tool calls using Amazon Bedrock Guardrails. Defends against prompt injection, restricts sensitive content generation, and enforces organizational policies (e.g. "do not modify files in /infrastructure"). See SECURITY.md. Guardrails configuration can be per-repo (via onboarding) or platform-wide.
Capability-based security model — Fine-grained enforcement beyond Bedrock Guardrails, operating at three levels: (1) Tool-level capabilities — Bash command allowlist (git, npm, make permitted; curl, wget blocked), configurable per capability tier (standard / elevated / read-only). (2) File-system scope — Blueprint declares include/exclude path patterns; Write/Edit/Read tools are filtered to the declared scope. (3) Input trust scoring — Authenticated user input = trusted; external GitHub issues = untrusted; PR review comments entering memory = adversarial. Trust level selects the capability set. Essential once review feedback memory (Iter 3d) introduces attacker-controlled content into the agent's context. Blueprint security prop configures the capability profile per repo.
Additional execution environment — Support an alternative to AgentCore Runtime (e.g. ECS/Fargate, EKS) behind the ComputeStrategy interface (see REPO_ONBOARDING.md). The orchestrator calls abstract methods (startSession, stopSession, pollSession); the implementation maps to AgentCore, Fargate, or EKS. Repos select the strategy via compute_type in their blueprint configuration. Reduces vendor lock-in and enables workloads that exceed AgentCore limits (e.g. GPU, larger images, longer sessions). The ComputeStrategy interface contract is defined in Iteration 3a; Iteration 5 adds alternative implementations.
Full web dashboard — Extend the control panel from Iteration 4: detailed dashboards (cost, performance, evaluation), reasoning trace viewer or log explorer (linked to OpenTelemetry traces from AgentCore), task submit/cancel from the UI, and admin views (system health, capacity, user management).
Customization (advanced) with tiered tool access — Agent can be extended with MCP servers, plugins, and skills beyond the basic prompt-from-repo customization in Iteration 3a. Composable tool sets per repo. MCP server discovery and lifecycle management. More tools increase behavioral unpredictability, so use a tiered tool access model: a minimal default tool set (bash allowlist, git, verify/lint/test) that all repos get, with MCP servers and plugins as opt-in per repo during onboarding. Per-repo tool profiles are stored in the onboarding config and loaded by the orchestrator. This balances flexibility with predictability. See SECURITY.md and REPO_ONBOARDING.md.

Builds on Iteration 4: Adds pre-warming, multi-user, cost management, guardrails, alternate runtime, and advanced customization with tiered tool access.

Iteration 6 — Learning, advanced workflows, and reuse

Goal: Skills learned from repo interaction; multi-repo tasks; iterative human-agent collaboration; reusable CDK constructs.

GitHub Actions integration — Publish a GitHub Action that triggers a ABCA task (e.g. on issue label like agent:fix, on flaky test detection, or on PR comment command). The Action calls the webhook endpoint from Iteration 3a. Natural integration for GitHub-centric workflows.
Automated pipeline for learning skills from repo interaction — Pipeline that observes agent interactions with repositories and produces reusable skills (rules, prompts, tools) that improve future runs. Builds on memory, code attribution, and evaluation. Example: the pipeline notices that tasks on repo X frequently fail because of a missing env variable, and generates a rule that the agent always sets it.
Agent swarm orchestration — Planner-worker architecture for complex, multi-file tasks that overwhelm a single agent session. A lightweight planner decomposes the task into a DAG of subtasks with scope boundaries and interface contracts. Each subtask runs as an independent child task in its own AgentCore session. A merge orchestrator cherry-picks commits, resolves conflicts, and runs the full test suite before opening one consolidated PR. New DynamoDB fields: parent_task_id, child_task_ids[], subtask_contract. New blueprint steps: decompose-task, fan-out + wait-all, merge-and-verify. Naturally bounds PR size and enables work that no single-session agent can handle (large features, cross-cutting refactors, migrations).
Multi-repo support — Tasks that span multiple repositories (e.g. change an API in repo A and update the consumer in repo B). Requires: multi-branch orchestration (one branch per repo), coordinated PR creation (linked PRs), cross-repo auth (GitHub App installations for both repos), and cross-repo testing. This is architecturally significant and needs a dedicated design doc before implementation.
Iterative feedback and multiplayer sessions — User can send follow-up instructions to a completed or running task (e.g. "also add tests for X" or "change the approach to use library Y"). For completed tasks, the platform starts a new session on the same branch with the follow-up context. For running tasks, this requires message injection into a live session — which depends on agent harness support for session persistence and message channels. Design the interaction model carefully: what happens to in-flight work when instructions change? Multiplayer extension: allow multiple authorized users to inject context into a running or follow-up session (e.g. team code reviews or collaborative debugging with the agent). Per-prompt commit attribution (Iter 3b) supports tracking which user's input led to which changes.
HITL approval mode — Optional mid-task approval gates for high-risk operations (e.g. "agent wants to delete 50 files — approve?"). The orchestrator pauses the task, emits a notification, and waits for user approval before continuing. Requires changes to the agent harness (pause/resume) and the orchestrator (a new AWAITING_APPROVAL state in the state machine).
Scheduled triggers — Cron or schedule-based task creation (e.g. "run dependency update every Monday", "check for flaky tests nightly"). Implemented as EventBridge Scheduler rules that call the task creation API. Schedules are configured per repo during onboarding or via the control panel.
CDK constructs — Publish reusable CDK constructs (e.g. BackgroundAgentStack, OnboardingPipelineStack, TaskOrchestrator) so other teams can compose the platform into their own CDK apps. Document construct APIs, publish to a construct library (e.g. Construct Hub), and version following semver.

Builds on Iteration 5: Leverages memory, evaluation, and customization to close the loop (learn → improve); adds advanced workflows and exposes the platform as constructs.

Summary and mapping to design

Iteration 1 — Core agent + git (isolated run, CLI submit, branch + PR, minimal task state).
Iteration 2 — Production orchestrator, API contract, task management (list/status/cancel), durable execution, observability, threat model, network isolation, basic cost guardrails, CI/CD.
Iteration 3a — Repo onboarding, DNS Firewall (domain-level egress filtering), webhook trigger, GitHub Actions, per-repo customization (prompt from repo), data retention, turn/iteration caps, cost budget caps, user prompt guide, agent harness improvements (turn budget, default branch, safety net, lint, softened conventions), operator dashboard, WAF, model invocation logging, input length limits.
Iteration 3b ✅ — Memory Tier 1 (repo knowledge, task episodes), insights, agent self-feedback, prompt versioning, per-prompt commit attribution. CDK L2 construct with named semantic + episodic strategies using namespace templates (/{actorId}/knowledge/, /{actorId}/episodes/{sessionId}/), fail-open memory load/write, orchestrator fallback episode, SHA-256 prompt hashing, git trailer attribution.
Iteration 3c — Per-repo GitHub App credentials, orchestrator pre-flight checks (fail-closed before session start), pre-execution task risk classification (model/limits/approval policy selection), tiered validation pipeline (tool validation, code quality analysis, post-execution risk/blast radius analysis), PR risk level, PR review task type, multi-modal input.
Iteration 3d — Review feedback memory loop (Tier 2), PR outcome tracking, evaluation pipeline (basic).
Iteration 3e — Memory security and integrity: input hardening (content sanitization, provenance tagging, integrity hashing), trust-aware retrieval (trust scoring, temporal decay, guardian validation), detection and response (anomaly detection, circuit breaker, quarantine, rollback), advanced protections (write-ahead validation, behavioral drift detection, cryptographic provenance, red teaming). Addresses OWASP ASI06 (Memory & Context Poisoning).
Iteration 3bis (hardening) — Orchestrator IAM grant for Memory (was silently AccessDenied), memory schema versioning (schema_version: "2"), Python repo format validation, severity-aware error logging in Python memory, narrowed entrypoint try-catch, orchestrator fallback episode observability, conditional writes in agent task_state.py (ConditionExpression guards), orchestrator Lambda error alarm (CloudWatch, retryAttempts: 0), concurrency counter reconciliation (scheduled Lambda, drift correction), multi-AZ NAT documentation (already configurable), Python unit tests (pytest), entrypoint decomposition (4 extracted subfunctions), dual prompt assembly deprecation docstring, graceful thread drain in server.py (shutdown hook + atexit), dead QUEUED state removal (8 states, 4 active).
Iteration 4 — Additional git providers, visual proof (screenshots/videos), Slack channel, skills pipeline, user preference memory (Tier 3), control panel (restrict CORS to dashboard origin), real-time event streaming (WebSocket), live session replay and mid-task nudge, browser extension client, MFA for production.
Iteration 5 — Snapshot-on-schedule pre-warming, multi-user/team, memory isolation for multi-tenancy, full cost management, adaptive model router with cost-aware cascade, advanced evaluation (optional adaptive-teaching / trajectory-driven prompt patterns), formal orchestrator verification with TLA+/TLC, full Bedrock Guardrails (PII, denied topics, output filters), capability-based security model, alternate runtime, advanced customization with tiered tool access (MCP/plugins via AgentCore Gateway), full dashboard, AI-specific WAF rules.
Iteration 6 — Agent swarm orchestration, skills learning, multi-repo, iterative feedback and multiplayer sessions, HITL approval, scheduled triggers, CDK constructs.

Design docs to keep in sync: ARCHITECTURE.md, ORCHESTRATOR.md, API_CONTRACT.md, INPUT_GATEWAY.md, REPO_ONBOARDING.md, MEMORY.md, OBSERVABILITY.md, COMPUTE.md, CONTROL_PANEL.md, SECURITY.md, EVALUATION.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

Ongoing engineering practice (cross-iteration)

Iteration 1 — First shippable slice (done)

Iteration 2 — Production orchestrator, task management, and observability (done)

Task management and API

Orchestration and storage

Security and network

Cost and observability

Platform operations

Iteration 3 (wip, we are here — 3a and 3b done)

Iteration 3a — Repo onboarding and access control

Iteration 3b — Core memory and learning (done)

Iteration 3bis

Iteration 3c — Validation and new task types

Iteration 3d — Review feedback loop and evaluation

Iteration 3e — Memory security and integrity

Background

Phase 1 — Input hardening

Phase 2 — Trust-aware retrieval

Phase 3 — Detection and response

Phase 4 — Advanced protections

Non-backward-compatible changes

Iteration 4 — Integrations, visual proof, and control panel

Iteration 5 — Scale, cost, and platform maturity

Iteration 6 — Learning, advanced workflows, and reuse

Summary and mapping to design

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

Roadmap

Ongoing engineering practice (cross-iteration)

Iteration 1 — First shippable slice (done)

Iteration 2 — Production orchestrator, task management, and observability (done)

Task management and API

Orchestration and storage

Security and network

Cost and observability

Platform operations

Iteration 3 (wip, we are here — 3a and 3b done)

Iteration 3a — Repo onboarding and access control

Iteration 3b — Core memory and learning (done)

Iteration 3bis

Iteration 3c — Validation and new task types

Iteration 3d — Review feedback loop and evaluation

Iteration 3e — Memory security and integrity

Background

Phase 1 — Input hardening

Phase 2 — Trust-aware retrieval

Phase 3 — Detection and response

Phase 4 — Advanced protections

Non-backward-compatible changes

Iteration 4 — Integrations, visual proof, and control panel

Iteration 5 — Scale, cost, and platform maturity

Iteration 6 — Learning, advanced workflows, and reuse

Summary and mapping to design