Background
Current multi-endpoint support only does client-side endpoint selection. It does not maintain endpoint health state or automatically avoid unhealthy nodes.
Today:
- Unary calls (
Write, Delete, HealthCheck, BulkWrite) pick one endpoint per call.
- If the selected endpoint is unhealthy or a request fails due to a retryable transport failure, the client does not automatically retry on another healthy endpoint.
- Streaming calls bind to one endpoint for the lifetime of the stream.
For HA deployments, this is incomplete. Multi-endpoint support should include failover behavior, not just endpoint selection.
Problem
When one endpoint becomes unavailable or intermittently unhealthy:
- the picker may continue selecting it
- requests can repeatedly fail even if other endpoints are healthy
- callers have to absorb transient endpoint failures themselves
This is especially problematic when the failure is clearly retryable, for example:
- connection refused
- connection reset
- deadline exceeded while establishing or using the transport
- transient gRPC
Unavailable / similar transport-level failures
In those cases, the client should be able to avoid the unhealthy endpoint temporarily and retry on another endpoint.
Expected behavior
We should support multi-endpoint failover with endpoint health awareness.
At minimum:
- detect retryable endpoint failures
- temporarily mark the failed endpoint as unhealthy
- retry the request on another healthy endpoint when the operation is safe to retry
- periodically probe or expire unhealthy state so recovered endpoints can rejoin rotation
Design points
Open questions to define explicitly:
- Which error classes are considered retryable?
- Which APIs are safe to retry automatically?
- How many retries should be attempted across endpoints?
- How long should an endpoint stay out of rotation after failure?
- Should recovery use passive expiration, active health check, or both?
A reasonable first scope would be:
- unary APIs only
- transport-level retryable failures only
- bounded retries across remaining healthy endpoints
- simple unhealthy TTL / backoff before the endpoint is eligible again
Non-goals for the first step
- transparent mid-stream failover for an existing streaming session
- retrying non-idempotent operations without explicit policy
- complex circuit-breaker tuning from day one
Acceptance criteria
- A failed retryable request can be retried on another healthy endpoint.
- Endpoints that recently failed are temporarily removed from normal selection.
- Recovered endpoints can re-enter rotation after a health check or timeout.
- Tests cover endpoint failure, failover retry, and endpoint recovery.
- README/API docs describe the failover behavior and retry boundaries clearly.
Background
Current multi-endpoint support only does client-side endpoint selection. It does not maintain endpoint health state or automatically avoid unhealthy nodes.
Today:
Write,Delete,HealthCheck,BulkWrite) pick one endpoint per call.For HA deployments, this is incomplete. Multi-endpoint support should include failover behavior, not just endpoint selection.
Problem
When one endpoint becomes unavailable or intermittently unhealthy:
This is especially problematic when the failure is clearly retryable, for example:
Unavailable/ similar transport-level failuresIn those cases, the client should be able to avoid the unhealthy endpoint temporarily and retry on another endpoint.
Expected behavior
We should support multi-endpoint failover with endpoint health awareness.
At minimum:
Design points
Open questions to define explicitly:
A reasonable first scope would be:
Non-goals for the first step
Acceptance criteria