Summary
Flaky E2E test "can lease and connect to exporters" fails ~10% of the time with Error: Connection to exporter lost. The exporter remains stuck in Available status and never transitions to LeaseReady, causing all client Dial retries to time out.
Example failure: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24248226337/job/70800799224#step:10:842
Related (red herring): #414 — focused on listenQueues cleanup race; closed as not the real cause.
Root Cause Analysis
The deadlock
In exporter.py, for exporters without a hook_executor, the before_lease_hook event is only set at line 740, inside the conn_tg task group:
# exporter.py lines 695-747
try:
async with create_task_group() as conn_tg:
conn_tg.start_soon(self._retry_stream, ...) # line 698
conn_tg.start_soon(wait_for_lease_end) # line 732
conn_tg.start_soon(process_connections) # line 733
# This is the ONLY place before_lease_hook is set for no-hook exporters
if not self.hook_executor: # line 738
await self._report_status(...) # line 739
lease_scope.before_lease_hook.set() # line 740 <-- NEVER REACHED
finally:
await listen_tx.aclose()
await self._cleanup_after_lease(lease_scope) # line 747
When the lease ends quickly (same second it was assigned), wait_for_lease_end() fires and calls conn_tg.cancel_scope.cancel(), which cancels the host task before line 740 executes. The before_lease_hook event is never set.
Then in _cleanup_after_lease() (line 590-629):
with CancelScope(shield=True):
await lease_scope.before_lease_hook.wait() # line 600 — BLOCKS FOREVER
...
lease_scope.after_lease_hook_done.set() # line 626 — NEVER REACHED
And in serve() (line 821):
await lease_ctx.after_lease_hook_done.wait() # BLOCKS FOREVER — serve() is stuck
Result: serve() can never process the next status update. The exporter is permanently stuck.
How the E2E test triggers it
The test sequence creates rapid lease/unlease cycles on the same exporter:
- "can operate on leases" — creates a lease on
test-exporter-oidc, then deletes it (no Dial)
- "paginated lease listing" — creates 101 leases on the same exporter, then deletes all
- "lease listing shows expires at" — creates lease, deletes it
- "can transfer lease" — creates lease, transfers client, deletes it
- "can lease and connect" — the failing test: tries to
shell into the exporter
Each create lease / delete leases cycle sends leased=true → leased=false via the Status stream. If any of these cycles ends before handle_lease() reaches line 740, the deadlock occurs and serve() is permanently blocked.
Evidence from CI logs
The exporter logs for test-exporter-oidc end abruptly:
Lease ended event received, stopping connection handling
There is no subsequent "Updated status to AVAILABLE" or "afterLease hook completed" — confirming _cleanup_after_lease is blocked at line 600.
The controller logs show the new lease was assigned and status update sent at 14:50:31, but the exporter never processes it. The Dial retries loop from 14:50:31 to 14:50:51 (20s), all seeing Available status:
Exporter in Available status, waiting for lease setup attempt=1 retryDelay=500ms
Exporter in Available status, waiting for lease setup attempt=2 retryDelay=1s
...
Dial rejected due to exporter status status=Available error="exporter is not ready (status: Available)"
Why prior mitigations didn't help
Commits 2264a0d, f400b21, f473ede added server-side Dial retry with exponential backoff (up to 30s). These help when the exporter is slow to transition. They do not help when the exporter is deadlocked and will never transition.
Proposed Fix
Set before_lease_hook unconditionally in the finally block, before calling _cleanup_after_lease:
# exporter.py, replace lines 741-747 with:
finally:
# CRITICAL: Always set before_lease_hook to prevent deadlock in
# _cleanup_after_lease(). When conn_tg is cancelled before line 740
# (e.g., lease ends during session setup), this event is never set,
# causing _cleanup_after_lease to block forever at line 600.
if not lease_scope.before_lease_hook.is_set():
lease_scope.before_lease_hook.set()
await listen_tx.aclose()
await self._cleanup_after_lease(lease_scope)
Additional recommended improvements
-
Add a safety timeout in _cleanup_after_lease as defense-in-depth:
with move_on_after(30) as scope:
await lease_scope.before_lease_hook.wait()
if scope.cancel_called:
logger.warning("Timed out waiting for before_lease_hook in cleanup — possible deadlock avoided")
-
Enable DEBUG logging for exporters in E2E CI to capture status transitions and task group lifecycle events. Currently the exporter logs are at INFO level, which misses critical state like "Starting to process connection requests" and task group enter/exit.
-
Add structured trace logging around before_lease_hook.set() calls so we can always see which code path set the event and when:
logger.debug("Setting before_lease_hook event (source=%s)", source)
-
Add a watchdog/diagnostic log to _cleanup_after_lease that logs a warning if before_lease_hook.wait() takes more than 5 seconds — this would make the deadlock immediately visible in logs even without DEBUG level.
Affected Files
python/packages/jumpstarter/jumpstarter/exporter/exporter.py — handle_lease() finally block (primary fix)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py — _cleanup_after_lease() (safety timeout)
Reproducing
The flake is triggered by rapid lease/unlease cycles on a no-hook exporter, which happens naturally in the E2E test suite. It occurs ~10% of the time on CI (GitHub Actions ubuntu-24.04).
The "paginated lease listing" test creating 101 rapid lease cycles is the most likely trigger, but any test that creates and immediately deletes a lease can trigger it.
Summary
Flaky E2E test
"can lease and connect to exporters"fails ~10% of the time withError: Connection to exporter lost. The exporter remains stuck inAvailablestatus and never transitions toLeaseReady, causing all client Dial retries to time out.Example failure: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24248226337/job/70800799224#step:10:842
Related (red herring): #414 — focused on
listenQueuescleanup race; closed as not the real cause.Root Cause Analysis
The deadlock
In
exporter.py, for exporters without ahook_executor, thebefore_lease_hookevent is only set at line 740, inside theconn_tgtask group:When the lease ends quickly (same second it was assigned),
wait_for_lease_end()fires and callsconn_tg.cancel_scope.cancel(), which cancels the host task before line 740 executes. Thebefore_lease_hookevent is never set.Then in
_cleanup_after_lease()(line 590-629):And in
serve()(line 821):Result:
serve()can never process the next status update. The exporter is permanently stuck.How the E2E test triggers it
The test sequence creates rapid lease/unlease cycles on the same exporter:
test-exporter-oidc, then deletes it (no Dial)shellinto the exporterEach
create lease/delete leasescycle sendsleased=true→leased=falsevia the Status stream. If any of these cycles ends beforehandle_lease()reaches line 740, the deadlock occurs andserve()is permanently blocked.Evidence from CI logs
The exporter logs for
test-exporter-oidcend abruptly:There is no subsequent
"Updated status to AVAILABLE"or"afterLease hook completed"— confirming_cleanup_after_leaseis blocked at line 600.The controller logs show the new lease was assigned and status update sent at 14:50:31, but the exporter never processes it. The Dial retries loop from 14:50:31 to 14:50:51 (20s), all seeing
Availablestatus:Why prior mitigations didn't help
Commits
2264a0d,f400b21,f473edeadded server-side Dial retry with exponential backoff (up to 30s). These help when the exporter is slow to transition. They do not help when the exporter is deadlocked and will never transition.Proposed Fix
Set
before_lease_hookunconditionally in thefinallyblock, before calling_cleanup_after_lease:Additional recommended improvements
Add a safety timeout in
_cleanup_after_leaseas defense-in-depth:Enable DEBUG logging for exporters in E2E CI to capture status transitions and task group lifecycle events. Currently the exporter logs are at INFO level, which misses critical state like "Starting to process connection requests" and task group enter/exit.
Add structured trace logging around
before_lease_hook.set()calls so we can always see which code path set the event and when:Add a watchdog/diagnostic log to
_cleanup_after_leasethat logs a warning ifbefore_lease_hook.wait()takes more than 5 seconds — this would make the deadlock immediately visible in logs even without DEBUG level.Affected Files
python/packages/jumpstarter/jumpstarter/exporter/exporter.py—handle_lease()finally block (primary fix)python/packages/jumpstarter/jumpstarter/exporter/exporter.py—_cleanup_after_lease()(safety timeout)Reproducing
The flake is triggered by rapid lease/unlease cycles on a no-hook exporter, which happens naturally in the E2E test suite. It occurs ~10% of the time on CI (GitHub Actions ubuntu-24.04).
The
"paginated lease listing"test creating 101 rapid lease cycles is the most likely trigger, but any test that creates and immediately deletes a lease can trigger it.