Summary
PR #397 introduced a race condition in controller/internal/service/controller_service.go that intermittently causes E2E test failures with Error: Connection to exporter lost (tests 47 & 48).
Root Cause
PR #397 added defer s.cleanupListenQueue(leaseName) to the Listen() gRPC handler (exporter side). This creates a race with Dial() (client side):
- Exporter's
Listen() stream fails transiently → defer cleanupListenQueue fires and deletes the queue from listenQueues
- Simultaneously (or just before),
Dial() does LoadOrStore → finds the old queue still in the map → sends the router token into it
cleanupListenQueue deletes the queue → token is discarded
- Exporter reconnects,
Listen() creates a fresh empty queue → never receives the router address
- Client waits 20 seconds for a connection →
"Connection to exporter lost"
PR #396 compounds this by resetting the exporter's retry counter after any successful data exchange, causing more aggressive reconnects and widening the race window.
Observed Failure
INFO Waiting for ready connection at /run/user/1001/jumpstarter-xxx/socket
# (20 seconds later)
INFO Releasing Lease 019d65c4-f2c3-7b9e-a626-6f259bf280c8
Error: Connection to exporter lost
Example CI run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24060390906/job/70175302716?pr=254
Proposed Fix
In Listen(), instead of unconditionally deleting the queue on any exit, only delete it when the exporter is intentionally deregistering (i.e. lease released / context cancelled cleanly). On stream errors, the queue should be left intact so a reconnecting exporter can still receive the pending router token.
Alternatively, use CompareAndDelete to only delete the specific channel instance created by the current Listen invocation, preventing a reconnected Listen from having its newly-created queue deleted.
// Instead of:
defer s.cleanupListenQueue(leaseName)
// Use CompareAndDelete to only delete the queue this invocation created:
defer func() {
s.listenQueues.CompareAndDelete(leaseName, queue)
}()
Affected Files
Summary
PR #397 introduced a race condition in
controller/internal/service/controller_service.gothat intermittently causes E2E test failures withError: Connection to exporter lost(tests 47 & 48).Root Cause
PR #397 added
defer s.cleanupListenQueue(leaseName)to theListen()gRPC handler (exporter side). This creates a race withDial()(client side):Listen()stream fails transiently →defer cleanupListenQueuefires and deletes the queue fromlistenQueuesDial()doesLoadOrStore→ finds the old queue still in the map → sends the router token into itcleanupListenQueuedeletes the queue → token is discardedListen()creates a fresh empty queue → never receives the router address"Connection to exporter lost"PR #396 compounds this by resetting the exporter's retry counter after any successful data exchange, causing more aggressive reconnects and widening the race window.
Observed Failure
Example CI run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24060390906/job/70175302716?pr=254
Proposed Fix
In
Listen(), instead of unconditionally deleting the queue on any exit, only delete it when the exporter is intentionally deregistering (i.e. lease released / context cancelled cleanly). On stream errors, the queue should be left intact so a reconnecting exporter can still receive the pending router token.Alternatively, use
CompareAndDeleteto only delete the specific channel instance created by the currentListeninvocation, preventing a reconnectedListenfrom having its newly-created queue deleted.Affected Files
controller/internal/service/controller_service.go—Listen()andDial()handlers,cleanupListenQueuepython/packages/jumpstarter/jumpstarter/exporter/exporter.py— retry reset logic (PR fix: reset retry counter after receiving data in exporter reconnect #396)