Skip to content

VFolder.download and Request.connect_websocket bypass APIConfig timeouts #11403

@rapsealk

Description

@rapsealk

Summary

Two paths in the Backend.AI Python client (backend.ai-client==26.4.4rc2) construct their own HTTP/WS client objects for outbound calls without propagating APIConfig.connection_timeout / APIConfig.read_timeout. A wedged Manager or storage backend therefore hangs the caller indefinitely, defeating the contract the rest of the SDK honors via Request.fetch() (which builds a per-call aiohttp.ClientTimeout(sock_connect=…, sock_read=…)).

Surfaced during the SDK timeout audit on lablup/backend.ai-fasttrack#3774 (PR lablup/backend.ai-fasttrack#3776).

Affected paths

1. Request.connect_websocketrequest.py:347-386

The websocket-upgrade path constructs the upgrade request without an aiohttp.ClientTimeout, relying on aiohttp defaults. A wedged Manager during the upgrade handshake will hang indefinitely.

2. VFolder.download()func/vfolder.py:364

The direct-to-storage transfer constructs a fresh aiohttp.ClientSession() without propagating APIConfig timeouts. A hung storage backend will block the call indefinitely.

Downstream impact (FastTrack)

  • workflow/experiments/metrics_ingestion.py calls vfolder.download(...) from an outbox handler. A wedged storage backend hangs an outbox worker; the pool-limited workers back up the entire experiment-metrics ingestion queue until killed.
  • FastTrack has shipped a thread-pool-deadline workaround in lablup/backend.ai-fasttrack#3779 to free outbox workers, but the worker thread still leaks because the SDK exposes no cancellation hook.
  • No FastTrack call sites use connect_websocket today (informational only) — but it's a real risk for any future Manager-WS consumer.

Suggested fix

Have both paths build an aiohttp.ClientTimeout(sock_connect=APIConfig.connection_timeout, sock_read=APIConfig.read_timeout) from the active config and pass it through, exactly as Request.fetch() already does. This restores the SDK-wide invariant that every outbound call honors the user's APIConfig deadline.

References

  • Audit issue: lablup/backend.ai-fasttrack#3774
  • Audit PR: lablup/backend.ai-fasttrack#3776
  • FastTrack workaround: lablup/backend.ai-fasttrack#3777 / PR lablup/backend.ai-fasttrack#3779

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions