Skip to content

Enhanced Health Monitoring for MSK Connectivity #183

@benjaminpaige

Description

@benjaminpaige

Priority: Medium

Background:
Following the October 3rd MSK connectivity incident, we need enhanced monitoring to detect connectivity issues faster. The incident was caused by removal of the broad private network access rule (10.0.0.0/8) that allows connectors to reach MSK.

Current Issue:
testConnectors Lambda runs every minute but failed to detect the 3-day MSK outage when the private network access rule was removed.

Requirements:

  • Enhance testConnectors to validate MSK broker connectivity from connector security group
  • Add validation that broad private network access rule (10.0.0.0/8) exists
  • Implement graduated alerting (warn at 2 failures, critical at 5 failures)
  • Cross-reference with BigMAC platform health metrics

Acceptance Criteria:

  • Health checks validate actual MSK broker connectivity (not just connector status)
  • Alert if private network access to MSK is blocked
  • Enhanced logging for connection attempts with broker-specific details
  • Integration with BigMAC platform monitoring dashboard

Technical Implementation:

  • Test actual socket connections to MSK brokers from connector container
  • Validate presence of MSK security group ingress rule for private networks
  • Enhance alerting to distinguish between connector vs MSK accessibility issues

Expected Outcome: Detect missing MSK access rules within minutes instead of days

Related: MSK Connectivity Incident - Private network access rule removal

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions