Problem
In a multi-node Deco mesh setup, TplinkDecoClientUpdateCoordinator._async_update_data() uses a bare asyncio.gather() to fetch client lists from all nodes in parallel. If any single node returns an error (HTTP 500, timeout, connection refused), the entire gather fails with an unhandled exception, and all device trackers go to unavailable — not just the ones connected to the failing node.
This causes false "away" presence detection events, which in turn triggers automations incorrectly (e.g. arming alarms, turning off lights when people are actually home).
How to reproduce
- Set up a Deco mesh with 2+ nodes
- Use
device_tracker entities for presence detection
- Wait for one node to return an intermittent HTTP 500 (this happens regularly on Deco BE63 firmware 1.2.10, and has been reported across many Deco models)
Expected: Only clients on the failing node lose tracking; clients on healthy nodes continue updating normally.
Actual: All clients across all nodes go unavailable simultaneously.
Root cause
The current code in coordinator.py (line ~252):
deco_client_responses = await asyncio.gather(
*[
async_call_and_propagate_config_error(
self.api.async_list_clients, deco_mac
)
for deco_mac in deco_macs
]
)
asyncio.gather() propagates the first exception from any task, so a single node failure aborts the entire update cycle.
Proposed fix
I have a fix ready in PR form that:
- Wraps each per-node fetch in
_async_fetch_clients_for_deco() which catches errors per-node while still propagating ConfigEntryAuthFailed
- Tracks which nodes failed (
failed_deco_macs) vs succeeded
- Preserves last-known client state for devices on unreachable nodes instead of marking them offline
- Only raises
UpdateFailed when all nodes fail simultaneously
I've been running this patch in production for over a month on a 2-node Deco BE63 mesh. It handles the common case (one flaky node returning 500s) gracefully — presence detection stays stable even with hundreds of per-node errors per week.
Related issues
Problem
In a multi-node Deco mesh setup,
TplinkDecoClientUpdateCoordinator._async_update_data()uses a bareasyncio.gather()to fetch client lists from all nodes in parallel. If any single node returns an error (HTTP 500, timeout, connection refused), the entire gather fails with an unhandled exception, and all device trackers go tounavailable— not just the ones connected to the failing node.This causes false "away" presence detection events, which in turn triggers automations incorrectly (e.g. arming alarms, turning off lights when people are actually home).
How to reproduce
device_trackerentities for presence detectionExpected: Only clients on the failing node lose tracking; clients on healthy nodes continue updating normally.
Actual: All clients across all nodes go
unavailablesimultaneously.Root cause
The current code in
coordinator.py(line ~252):asyncio.gather()propagates the first exception from any task, so a single node failure aborts the entire update cycle.Proposed fix
I have a fix ready in PR form that:
_async_fetch_clients_for_deco()which catches errors per-node while still propagatingConfigEntryAuthFailedfailed_deco_macs) vs succeededUpdateFailedwhen all nodes fail simultaneouslyI've been running this patch in production for over a month on a 2-node Deco BE63 mesh. It handles the common case (one flaky node returning 500s) gracefully — presence detection stays stable even with hundreds of per-node errors per week.
Related issues
not_homeasyncio.gather→TimeoutExceptionfailure path