Skip to content

Flaky E2E: exporter deadlocks when lease ends before before_lease_hook is set (no-hook exporters) #567

@ambient-code

Description

@ambient-code

Summary

Flaky E2E test "can lease and connect to exporters" fails ~10% of the time with Error: Connection to exporter lost. The exporter remains stuck in Available status and never transitions to LeaseReady, causing all client Dial retries to time out.

Example failure: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24248226337/job/70800799224#step:10:842

Related (red herring): #414 — focused on listenQueues cleanup race; closed as not the real cause.

Root Cause Analysis

The deadlock

In exporter.py, for exporters without a hook_executor, the before_lease_hook event is only set at line 740, inside the conn_tg task group:

# exporter.py lines 695-747
try:
    async with create_task_group() as conn_tg:
        conn_tg.start_soon(self._retry_stream, ...)   # line 698
        conn_tg.start_soon(wait_for_lease_end)          # line 732
        conn_tg.start_soon(process_connections)          # line 733

        # This is the ONLY place before_lease_hook is set for no-hook exporters
        if not self.hook_executor:                       # line 738
            await self._report_status(...)               # line 739
            lease_scope.before_lease_hook.set()           # line 740  <-- NEVER REACHED
finally:
    await listen_tx.aclose()
    await self._cleanup_after_lease(lease_scope)          # line 747

When the lease ends quickly (same second it was assigned), wait_for_lease_end() fires and calls conn_tg.cancel_scope.cancel(), which cancels the host task before line 740 executes. The before_lease_hook event is never set.

Then in _cleanup_after_lease() (line 590-629):

with CancelScope(shield=True):
    await lease_scope.before_lease_hook.wait()  # line 600 — BLOCKS FOREVER
    ...
    lease_scope.after_lease_hook_done.set()      # line 626 — NEVER REACHED

And in serve() (line 821):

await lease_ctx.after_lease_hook_done.wait()    # BLOCKS FOREVER — serve() is stuck

Result: serve() can never process the next status update. The exporter is permanently stuck.

How the E2E test triggers it

The test sequence creates rapid lease/unlease cycles on the same exporter:

  1. "can operate on leases" — creates a lease on test-exporter-oidc, then deletes it (no Dial)
  2. "paginated lease listing" — creates 101 leases on the same exporter, then deletes all
  3. "lease listing shows expires at" — creates lease, deletes it
  4. "can transfer lease" — creates lease, transfers client, deletes it
  5. "can lease and connect" — the failing test: tries to shell into the exporter

Each create lease / delete leases cycle sends leased=trueleased=false via the Status stream. If any of these cycles ends before handle_lease() reaches line 740, the deadlock occurs and serve() is permanently blocked.

Evidence from CI logs

The exporter logs for test-exporter-oidc end abruptly:

Lease ended event received, stopping connection handling

There is no subsequent "Updated status to AVAILABLE" or "afterLease hook completed" — confirming _cleanup_after_lease is blocked at line 600.

The controller logs show the new lease was assigned and status update sent at 14:50:31, but the exporter never processes it. The Dial retries loop from 14:50:31 to 14:50:51 (20s), all seeing Available status:

Exporter in Available status, waiting for lease setup  attempt=1  retryDelay=500ms
Exporter in Available status, waiting for lease setup  attempt=2  retryDelay=1s
...
Dial rejected due to exporter status  status=Available  error="exporter is not ready (status: Available)"

Why prior mitigations didn't help

Commits 2264a0d, f400b21, f473ede added server-side Dial retry with exponential backoff (up to 30s). These help when the exporter is slow to transition. They do not help when the exporter is deadlocked and will never transition.

Proposed Fix

Set before_lease_hook unconditionally in the finally block, before calling _cleanup_after_lease:

# exporter.py, replace lines 741-747 with:
finally:
    # CRITICAL: Always set before_lease_hook to prevent deadlock in
    # _cleanup_after_lease(). When conn_tg is cancelled before line 740
    # (e.g., lease ends during session setup), this event is never set,
    # causing _cleanup_after_lease to block forever at line 600.
    if not lease_scope.before_lease_hook.is_set():
        lease_scope.before_lease_hook.set()
    await listen_tx.aclose()
    await self._cleanup_after_lease(lease_scope)

Additional recommended improvements

  1. Add a safety timeout in _cleanup_after_lease as defense-in-depth:

    with move_on_after(30) as scope:
        await lease_scope.before_lease_hook.wait()
    if scope.cancel_called:
        logger.warning("Timed out waiting for before_lease_hook in cleanup — possible deadlock avoided")
  2. Enable DEBUG logging for exporters in E2E CI to capture status transitions and task group lifecycle events. Currently the exporter logs are at INFO level, which misses critical state like "Starting to process connection requests" and task group enter/exit.

  3. Add structured trace logging around before_lease_hook.set() calls so we can always see which code path set the event and when:

    logger.debug("Setting before_lease_hook event (source=%s)", source)
  4. Add a watchdog/diagnostic log to _cleanup_after_lease that logs a warning if before_lease_hook.wait() takes more than 5 seconds — this would make the deadlock immediately visible in logs even without DEBUG level.

Affected Files

  • python/packages/jumpstarter/jumpstarter/exporter/exporter.pyhandle_lease() finally block (primary fix)
  • python/packages/jumpstarter/jumpstarter/exporter/exporter.py_cleanup_after_lease() (safety timeout)

Reproducing

The flake is triggered by rapid lease/unlease cycles on a no-hook exporter, which happens naturally in the E2E test suite. It occurs ~10% of the time on CI (GitHub Actions ubuntu-24.04).

The "paginated lease listing" test creating 101 rapid lease cycles is the most likely trigger, but any test that creates and immediately deletes a lease can trigger it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions