Single Deco node failure causes all device trackers to go unavailable

## Problem

In a multi-node Deco mesh setup, `TplinkDecoClientUpdateCoordinator._async_update_data()` uses a bare `asyncio.gather()` to fetch client lists from all nodes in parallel. If **any single node** returns an error (HTTP 500, timeout, connection refused), the entire gather fails with an unhandled exception, and **all** device trackers go to `unavailable` — not just the ones connected to the failing node.

This causes false "away" presence detection events, which in turn triggers automations incorrectly (e.g. arming alarms, turning off lights when people are actually home).

## How to reproduce

1. Set up a Deco mesh with 2+ nodes
2. Use `device_tracker` entities for presence detection
3. Wait for one node to return an intermittent HTTP 500 (this happens regularly on Deco BE63 firmware 1.2.10, and has been reported across many Deco models)

**Expected:** Only clients on the failing node lose tracking; clients on healthy nodes continue updating normally.

**Actual:** All clients across all nodes go `unavailable` simultaneously.

## Root cause

The current code in `coordinator.py` (line ~252):

```python
deco_client_responses = await asyncio.gather(
    *[
        async_call_and_propagate_config_error(
            self.api.async_list_clients, deco_mac
        )
        for deco_mac in deco_macs
    ]
)
```

`asyncio.gather()` propagates the first exception from any task, so a single node failure aborts the entire update cycle.

## Proposed fix

I have a fix ready in PR form that:
1. Wraps each per-node fetch in `_async_fetch_clients_for_deco()` which catches errors per-node while still propagating `ConfigEntryAuthFailed`
2. Tracks which nodes failed (`failed_deco_macs`) vs succeeded
3. Preserves last-known client state for devices on unreachable nodes instead of marking them offline
4. Only raises `UpdateFailed` when **all** nodes fail simultaneously

I've been running this patch in production for over a month on a 2-node Deco BE63 mesh. It handles the common case (one flaky node returning 500s) gracefully — presence detection stays stable even with hundreds of per-node errors per week.

## Related issues

- #39 — earliest report of this problem (Feb 2022)
- #95 — feature request for preserving last state
- #276 — devices permanently showing `not_home`
- #294 — same `asyncio.gather` → `TimeoutException` failure path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Deco node failure causes all device trackers to go unavailable #482

Problem

How to reproduce

Root cause

Proposed fix

Related issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Single Deco node failure causes all device trackers to go unavailable #482

Description

Problem

How to reproduce

Root cause

Proposed fix

Related issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions