Skip to content

Support multi-endpoint failover #103

@killme2008

Description

@killme2008

Background

Current multi-endpoint support only does client-side endpoint selection. It does not maintain endpoint health state or automatically avoid unhealthy nodes.

Today:

  • Unary calls (Write, Delete, HealthCheck, BulkWrite) pick one endpoint per call.
  • If the selected endpoint is unhealthy or a request fails due to a retryable transport failure, the client does not automatically retry on another healthy endpoint.
  • Streaming calls bind to one endpoint for the lifetime of the stream.

For HA deployments, this is incomplete. Multi-endpoint support should include failover behavior, not just endpoint selection.

Problem

When one endpoint becomes unavailable or intermittently unhealthy:

  • the picker may continue selecting it
  • requests can repeatedly fail even if other endpoints are healthy
  • callers have to absorb transient endpoint failures themselves

This is especially problematic when the failure is clearly retryable, for example:

  • connection refused
  • connection reset
  • deadline exceeded while establishing or using the transport
  • transient gRPC Unavailable / similar transport-level failures

In those cases, the client should be able to avoid the unhealthy endpoint temporarily and retry on another endpoint.

Expected behavior

We should support multi-endpoint failover with endpoint health awareness.

At minimum:

  • detect retryable endpoint failures
  • temporarily mark the failed endpoint as unhealthy
  • retry the request on another healthy endpoint when the operation is safe to retry
  • periodically probe or expire unhealthy state so recovered endpoints can rejoin rotation

Design points

Open questions to define explicitly:

  • Which error classes are considered retryable?
  • Which APIs are safe to retry automatically?
  • How many retries should be attempted across endpoints?
  • How long should an endpoint stay out of rotation after failure?
  • Should recovery use passive expiration, active health check, or both?

A reasonable first scope would be:

  • unary APIs only
  • transport-level retryable failures only
  • bounded retries across remaining healthy endpoints
  • simple unhealthy TTL / backoff before the endpoint is eligible again

Non-goals for the first step

  • transparent mid-stream failover for an existing streaming session
  • retrying non-idempotent operations without explicit policy
  • complex circuit-breaker tuning from day one

Acceptance criteria

  • A failed retryable request can be retried on another healthy endpoint.
  • Endpoints that recently failed are temporarily removed from normal selection.
  • Recovered endpoints can re-enter rotation after a health check or timeout.
  • Tests cover endpoint failure, failover retry, and endpoint recovery.
  • README/API docs describe the failover behavior and retry boundaries clearly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions