Priority: Medium
Background:
Following the October 3rd MSK connectivity incident, we need enhanced monitoring to detect connectivity issues faster. The incident was caused by removal of the broad private network access rule (10.0.0.0/8) that allows connectors to reach MSK.
Current Issue:
testConnectors Lambda runs every minute but failed to detect the 3-day MSK outage when the private network access rule was removed.
Requirements:
Acceptance Criteria:
Technical Implementation:
- Test actual socket connections to MSK brokers from connector container
- Validate presence of MSK security group ingress rule for private networks
- Enhance alerting to distinguish between connector vs MSK accessibility issues
Expected Outcome: Detect missing MSK access rules within minutes instead of days
Related: MSK Connectivity Incident - Private network access rule removal
Priority: Medium
Background:
Following the October 3rd MSK connectivity incident, we need enhanced monitoring to detect connectivity issues faster. The incident was caused by removal of the broad private network access rule (10.0.0.0/8) that allows connectors to reach MSK.
Current Issue:
testConnectors Lambda runs every minute but failed to detect the 3-day MSK outage when the private network access rule was removed.
Requirements:
Acceptance Criteria:
Technical Implementation:
Expected Outcome: Detect missing MSK access rules within minutes instead of days
Related: MSK Connectivity Incident - Private network access rule removal