AWS Resilience Skills

中文 | English

AWS Resilience Skills

A collection of AI-powered Agent Skills for comprehensive AWS system resilience — from maturity assessment through risk analysis to chaos engineering validation. Built for Claude Code, Kiro, and any AI coding assistant that supports the skill/prompt framework.

How the Three Skills Fit Together

These skills map to the AWS Resilience Lifecycle Framework, forming a complete resilience improvement pipeline:

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                        AWS Resilience Lifecycle Framework                            │
│                                                                                     │
│  Stage 1: Set Objectives    Stage 2: Design & Implement    Stage 3: Evaluate & Test │
│  ┌───────────────────┐      ┌───────────────────────┐      ┌─────────────────────┐  │
│  │  aws-rma-          │      │  resilience-            │      │  chaos-engineering-  │  │
│  │  assessment        │─────►│  modeling               │─────►│  on-aws              │  │
│  │                    │      │                        │      │                      │  │
│  │  "Where are we?"   │      │  "What could go wrong?"│      │  "Does it actually   │  │
│  │                    │      │                        │      │   break?"             │  │
│  └───────────────────┘      └───────────────────────┘      └──────────┬───────────┘  │
│                                        ▲                              │              │
│                                        └──────── Feedback Loop ───────┘              │
└─────────────────────────────────────────────────────────────────────────────────────┘

#	Skill	Lifecycle Stage	Input	Output
1	aws-rma-assessment	Stage 1: Set Objectives	Guided Q&A with stakeholders	Resilience maturity score + improvement roadmap
2	aws-resilience-modeling	Stage 2: Design & Implement	AWS account access or architecture docs	Risk inventory + resource scan + mitigation strategies
3	chaos-engineering-on-aws	Stage 3: Evaluate & Test	Assessment report from Skill #2	Experiment results + validation report + updated resilience score

Recommended Workflow

Start with RMA — Understand your organization's resilience maturity level and set improvement objectives
Run Resilience Assessment — Deep-dive into your AWS infrastructure to identify specific risks and failure modes
Execute Chaos Engineering — Validate findings through controlled fault injection experiments on real infrastructure
Close the Loop — Feed experiment results back into the assessment to update risk scores and track improvement

Skills Overview

1. RMA Assessment Assistant (`aws-rma-assessment`)

What it does: Interactive Resilience Maturity Assessment through guided Q&A, based on the AWS Resilience Maturity Assessment methodology.

Best for: Initial engagement — understanding where your organization stands on the resilience maturity spectrum.

Key features:

Structured questionnaire covering resilience dimensions
Maturity scoring aligned with AWS Well-Architected Framework
Improvement roadmap with prioritized recommendations
Interactive HTML report with visualizations

Invoke: Mention "RMA assessment" or "resilience maturity" in conversation.

2. Resilience Modeling (`aws-resilience-modeling`)

What it does: Comprehensive technical resilience analysis of AWS infrastructure — maps components, identifies failure modes, rates risks, and generates actionable mitigation strategies.

Best for: Deep technical analysis — finding specific vulnerabilities in your AWS architecture.

Key features:

Automated AWS resource scanning via CLI/MCP
Failure mode identification and classification (SPOF, latency, load, misconfiguration, shared fate)
9-dimension resilience scoring (5-star rating)
Risk-prioritized inventory with mitigation strategies
Structured output consumed by the Chaos Engineering skill

Invoke: Mention "AWS resilience assessment" or "韧性评估" in conversation.

3. Chaos Engineering on AWS (`chaos-engineering-on-aws`)

What it does: Executes the complete chaos engineering lifecycle — from experiment design through controlled fault injection to results analysis — using AWS FIS and optional Chaos Mesh.

Best for: Validation through action — proving (or disproving) that your system handles failures as expected.

Key features:

Six-step workflow: Target → Resources → Hypothesis → Pre-flight → Execute → Report
Dual engine: AWS FIS for infrastructure faults (node termination, AZ isolation, DB failover) + Chaos Mesh for Pod/container faults
Hybrid monitoring: background metric collection + agent-driven FIS status polling
State persistence across long-running experiments
Markdown + HTML dual-format reports with MTTR analysis
Game Day mode for team exercises

Invoke: Mention "chaos engineering", "fault injection", or "混沌工程" in conversation.

Fault Injection Tool Selection

Based on E2E testing, the chaos engineering skill enforces a clear division:

Layer	Tool	Examples
Infrastructure (nodes, network, databases)	AWS FIS	`eks:terminate-nodegroup-instances`, `network:disrupt-connectivity`, `rds:failover-db-cluster`
Pod/Container (application-level)	Chaos Mesh	`PodChaos`, `NetworkChaos`, `HTTPChaos`, `StressChaos`

⚠️ FIS aws:eks:pod-* actions are not recommended for Pod-level faults — they require additional K8s ServiceAccount/RBAC setup and have slow initialization (>2 min). Use Chaos Mesh instead.

Features

Based on AWS Well-Architected Framework Reliability Pillar (2025)
Integrates AWS Resilience Analysis Framework (Error Budgets, SLO/SLI/SLA)
Full Chaos Engineering lifecycle (AWS FIS + Chaos Mesh)
AWS Observability Best Practices (CloudWatch, X-Ray, Distributed Tracing)
Cloud Design Patterns (Circuit Breaker, Bulkhead, Retry)
Interactive HTML reports with Chart.js visualizations and Mermaid architecture diagrams

Prerequisites

1. AI Coding Assistant

Any AI coding assistant that supports custom skills: Claude Code, Kiro, Cursor, or similar.

2. Setup

git clone https://github.com/aws-samples/sample-gcr-resilience-skill.git

Copy the skill directories into your project's skills folder, or reference them directly.

3. AWS Access (Recommended)

AWS account with read-only access (assessment) or experiment permissions (chaos engineering)
AWS CLI configured with appropriate credentials
Optional: MCP servers for enhanced automation (see MCP_SETUP_GUIDE.md in each skill folder)

Project Structure

.
├── aws-rma-assessment/                # Resilience Maturity Assessment
│   ├── SKILL.md                       # Skill definition
│   ├── README.md                      # Skill documentation
│   └── references/                    # Reference documents
│       ├── questions-data.json        # 80 assessment questions (JSON)
│       ├── questions-priority.md      # Priority classification (P0-P3)
│       ├── question-groups.md         # Batch Q&A grouping strategy
│       └── report-template.md         # Report generation template
│
├── aws-resilience-modeling/               # Technical Resilience Assessment
│   ├── SKILL.md                       # Skill definition
│   ├── README.md                      # Skill documentation
│   ├── references/                    # Reference documents
│   │   ├── resilience-framework.md    # AWS best practices reference
│   │   ├── common-risks-reference.md  # 50+ common AWS risk patterns
│   │   ├── report-generation.md       # Report generation guide
│   │   ├── MCP_SETUP_GUIDE.md        # MCP server configuration
│   │   └── ...
│   ├── scripts/
│   │   └── generate-html-report.py    # HTML report generation script
│   └── assets/
│       ├── html-report-template.html  # Interactive HTML report template
│       └── example-report-template.md # Markdown report example
│
├── chaos-engineering-on-aws/          # Chaos Engineering Experiments
│   ├── SKILL.md                       # Skill definition (6-step workflow)
│   ├── MCP_SETUP_GUIDE.md             # MCP server configuration
│   ├── references/                    # Progressive-disclosure reference docs
│   │   ├── fis-actions.md             # AWS FIS actions reference
│   │   ├── chaosmesh-crds.md          # Chaos Mesh CRD reference
│   │   ├── report-templates.md        # Report templates (MD + HTML)
│   │   └── gameday.md                 # Game Day execution guide
│   ├── examples/                      # Experiment scenario examples
│   │   ├── 01-ec2-terminate.md        # EC2 instance termination
│   │   ├── 02-rds-failover.md         # RDS Aurora failover
│   │   ├── 03-eks-pod-kill.md         # EKS Pod kill (Chaos Mesh)
│   │   └── 04-az-network-disrupt.md   # AZ network isolation
│   ├── scripts/
│   │   └── monitor.sh                 # CloudWatch metric collection script
│   └── doc/                           # Design documents (PRD, decisions)
│
├── README.md                          # This file
└── README_zh.md                       # Chinese version

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Resilience Skills

How the Three Skills Fit Together

Recommended Workflow

Skills Overview

1. RMA Assessment Assistant (`aws-rma-assessment`)

2. Resilience Modeling (`aws-resilience-modeling`)

3. Chaos Engineering on AWS (`chaos-engineering-on-aws`)

Fault Injection Tool Selection

Features

Prerequisites

1. AI Coding Assistant

2. Setup

3. AWS Access (Recommended)

Project Structure

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
aws-resilience-modeling		aws-resilience-modeling
aws-rma-assessment		aws-rma-assessment
chaos-engineering-on-aws		chaos-engineering-on-aws
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md

Folders and files

Latest commit

History

Repository files navigation

AWS Resilience Skills

How the Three Skills Fit Together

Recommended Workflow

Skills Overview

1. RMA Assessment Assistant (aws-rma-assessment)

2. Resilience Modeling (aws-resilience-modeling)

3. Chaos Engineering on AWS (chaos-engineering-on-aws)

Fault Injection Tool Selection

Features

Prerequisites

1. AI Coding Assistant

2. Setup

3. AWS Access (Recommended)

Project Structure

Security

License

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. RMA Assessment Assistant (`aws-rma-assessment`)

2. Resilience Modeling (`aws-resilience-modeling`)

3. Chaos Engineering on AWS (`chaos-engineering-on-aws`)

Packages