Support multi-endpoint failover

## Background

Current multi-endpoint support only does client-side endpoint selection. It does not maintain endpoint health state or automatically avoid unhealthy nodes.

Today:
- Unary calls (`Write`, `Delete`, `HealthCheck`, `BulkWrite`) pick one endpoint per call.
- If the selected endpoint is unhealthy or a request fails due to a retryable transport failure, the client does not automatically retry on another healthy endpoint.
- Streaming calls bind to one endpoint for the lifetime of the stream.

For HA deployments, this is incomplete. Multi-endpoint support should include failover behavior, not just endpoint selection.

## Problem

When one endpoint becomes unavailable or intermittently unhealthy:
- the picker may continue selecting it
- requests can repeatedly fail even if other endpoints are healthy
- callers have to absorb transient endpoint failures themselves

This is especially problematic when the failure is clearly retryable, for example:
- connection refused
- connection reset
- deadline exceeded while establishing or using the transport
- transient gRPC `Unavailable` / similar transport-level failures

In those cases, the client should be able to avoid the unhealthy endpoint temporarily and retry on another endpoint.

## Expected behavior

We should support multi-endpoint failover with endpoint health awareness.

At minimum:
- detect retryable endpoint failures
- temporarily mark the failed endpoint as unhealthy
- retry the request on another healthy endpoint when the operation is safe to retry
- periodically probe or expire unhealthy state so recovered endpoints can rejoin rotation

## Design points

Open questions to define explicitly:
- Which error classes are considered retryable?
- Which APIs are safe to retry automatically?
- How many retries should be attempted across endpoints?
- How long should an endpoint stay out of rotation after failure?
- Should recovery use passive expiration, active health check, or both?

A reasonable first scope would be:
- unary APIs only
- transport-level retryable failures only
- bounded retries across remaining healthy endpoints
- simple unhealthy TTL / backoff before the endpoint is eligible again

## Non-goals for the first step

- transparent mid-stream failover for an existing streaming session
- retrying non-idempotent operations without explicit policy
- complex circuit-breaker tuning from day one

## Acceptance criteria

- A failed retryable request can be retried on another healthy endpoint.
- Endpoints that recently failed are temporarily removed from normal selection.
- Recovered endpoints can re-enter rotation after a health check or timeout.
- Tests cover endpoint failure, failover retry, and endpoint recovery.
- README/API docs describe the failover behavior and retry boundaries clearly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-endpoint failover #103

Background

Problem

Expected behavior

Design points

Non-goals for the first step

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support multi-endpoint failover #103

Description

Background

Problem

Expected behavior

Design points

Non-goals for the first step

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions