Skip to content

race condition in listenQueues cleanup causes intermittent 'Connection to exporter lost' #414

@ambient-code

Description

@ambient-code

Summary

PR #397 introduced a race condition in controller/internal/service/controller_service.go that intermittently causes E2E test failures with Error: Connection to exporter lost (tests 47 & 48).

Root Cause

PR #397 added defer s.cleanupListenQueue(leaseName) to the Listen() gRPC handler (exporter side). This creates a race with Dial() (client side):

  1. Exporter's Listen() stream fails transiently → defer cleanupListenQueue fires and deletes the queue from listenQueues
  2. Simultaneously (or just before), Dial() does LoadOrStore → finds the old queue still in the map → sends the router token into it
  3. cleanupListenQueue deletes the queue → token is discarded
  4. Exporter reconnects, Listen() creates a fresh empty queue → never receives the router address
  5. Client waits 20 seconds for a connection → "Connection to exporter lost"

PR #396 compounds this by resetting the exporter's retry counter after any successful data exchange, causing more aggressive reconnects and widening the race window.

Observed Failure

INFO     Waiting for ready connection at /run/user/1001/jumpstarter-xxx/socket
# (20 seconds later)
INFO     Releasing Lease 019d65c4-f2c3-7b9e-a626-6f259bf280c8
Error: Connection to exporter lost

Example CI run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24060390906/job/70175302716?pr=254

Proposed Fix

In Listen(), instead of unconditionally deleting the queue on any exit, only delete it when the exporter is intentionally deregistering (i.e. lease released / context cancelled cleanly). On stream errors, the queue should be left intact so a reconnecting exporter can still receive the pending router token.

Alternatively, use CompareAndDelete to only delete the specific channel instance created by the current Listen invocation, preventing a reconnected Listen from having its newly-created queue deleted.

// Instead of:
defer s.cleanupListenQueue(leaseName)

// Use CompareAndDelete to only delete the queue this invocation created:
defer func() {
    s.listenQueues.CompareAndDelete(leaseName, queue)
}()

Affected Files

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions