Skip to content

Single Deco node failure causes all device trackers to go unavailable #482

@johnib

Description

@johnib

Problem

In a multi-node Deco mesh setup, TplinkDecoClientUpdateCoordinator._async_update_data() uses a bare asyncio.gather() to fetch client lists from all nodes in parallel. If any single node returns an error (HTTP 500, timeout, connection refused), the entire gather fails with an unhandled exception, and all device trackers go to unavailable — not just the ones connected to the failing node.

This causes false "away" presence detection events, which in turn triggers automations incorrectly (e.g. arming alarms, turning off lights when people are actually home).

How to reproduce

  1. Set up a Deco mesh with 2+ nodes
  2. Use device_tracker entities for presence detection
  3. Wait for one node to return an intermittent HTTP 500 (this happens regularly on Deco BE63 firmware 1.2.10, and has been reported across many Deco models)

Expected: Only clients on the failing node lose tracking; clients on healthy nodes continue updating normally.

Actual: All clients across all nodes go unavailable simultaneously.

Root cause

The current code in coordinator.py (line ~252):

deco_client_responses = await asyncio.gather(
    *[
        async_call_and_propagate_config_error(
            self.api.async_list_clients, deco_mac
        )
        for deco_mac in deco_macs
    ]
)

asyncio.gather() propagates the first exception from any task, so a single node failure aborts the entire update cycle.

Proposed fix

I have a fix ready in PR form that:

  1. Wraps each per-node fetch in _async_fetch_clients_for_deco() which catches errors per-node while still propagating ConfigEntryAuthFailed
  2. Tracks which nodes failed (failed_deco_macs) vs succeeded
  3. Preserves last-known client state for devices on unreachable nodes instead of marking them offline
  4. Only raises UpdateFailed when all nodes fail simultaneously

I've been running this patch in production for over a month on a 2-node Deco BE63 mesh. It handles the common case (one flaky node returning 500s) gracefully — presence detection stays stable even with hundreds of per-node errors per week.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions