Summary
Two paths in the Backend.AI Python client (backend.ai-client==26.4.4rc2) construct their own HTTP/WS client objects for outbound calls without propagating APIConfig.connection_timeout / APIConfig.read_timeout. A wedged Manager or storage backend therefore hangs the caller indefinitely, defeating the contract the rest of the SDK honors via Request.fetch() (which builds a per-call aiohttp.ClientTimeout(sock_connect=…, sock_read=…)).
Surfaced during the SDK timeout audit on lablup/backend.ai-fasttrack#3774 (PR lablup/backend.ai-fasttrack#3776).
Affected paths
1. Request.connect_websocket — request.py:347-386
The websocket-upgrade path constructs the upgrade request without an aiohttp.ClientTimeout, relying on aiohttp defaults. A wedged Manager during the upgrade handshake will hang indefinitely.
2. VFolder.download() — func/vfolder.py:364
The direct-to-storage transfer constructs a fresh aiohttp.ClientSession() without propagating APIConfig timeouts. A hung storage backend will block the call indefinitely.
Downstream impact (FastTrack)
workflow/experiments/metrics_ingestion.py calls vfolder.download(...) from an outbox handler. A wedged storage backend hangs an outbox worker; the pool-limited workers back up the entire experiment-metrics ingestion queue until killed.
- FastTrack has shipped a thread-pool-deadline workaround in lablup/backend.ai-fasttrack#3779 to free outbox workers, but the worker thread still leaks because the SDK exposes no cancellation hook.
- No FastTrack call sites use
connect_websocket today (informational only) — but it's a real risk for any future Manager-WS consumer.
Suggested fix
Have both paths build an aiohttp.ClientTimeout(sock_connect=APIConfig.connection_timeout, sock_read=APIConfig.read_timeout) from the active config and pass it through, exactly as Request.fetch() already does. This restores the SDK-wide invariant that every outbound call honors the user's APIConfig deadline.
References
- Audit issue: lablup/backend.ai-fasttrack#3774
- Audit PR: lablup/backend.ai-fasttrack#3776
- FastTrack workaround: lablup/backend.ai-fasttrack#3777 / PR lablup/backend.ai-fasttrack#3779
Summary
Two paths in the Backend.AI Python client (
backend.ai-client==26.4.4rc2) construct their own HTTP/WS client objects for outbound calls without propagatingAPIConfig.connection_timeout/APIConfig.read_timeout. A wedged Manager or storage backend therefore hangs the caller indefinitely, defeating the contract the rest of the SDK honors viaRequest.fetch()(which builds a per-callaiohttp.ClientTimeout(sock_connect=…, sock_read=…)).Surfaced during the SDK timeout audit on lablup/backend.ai-fasttrack#3774 (PR lablup/backend.ai-fasttrack#3776).
Affected paths
1.
Request.connect_websocket—request.py:347-386The websocket-upgrade path constructs the upgrade request without an
aiohttp.ClientTimeout, relying onaiohttpdefaults. A wedged Manager during the upgrade handshake will hang indefinitely.2.
VFolder.download()—func/vfolder.py:364The direct-to-storage transfer constructs a fresh
aiohttp.ClientSession()without propagatingAPIConfigtimeouts. A hung storage backend will block the call indefinitely.Downstream impact (FastTrack)
workflow/experiments/metrics_ingestion.pycallsvfolder.download(...)from an outbox handler. A wedged storage backend hangs an outbox worker; the pool-limited workers back up the entire experiment-metrics ingestion queue until killed.connect_websockettoday (informational only) — but it's a real risk for any future Manager-WS consumer.Suggested fix
Have both paths build an
aiohttp.ClientTimeout(sock_connect=APIConfig.connection_timeout, sock_read=APIConfig.read_timeout)from the active config and pass it through, exactly asRequest.fetch()already does. This restores the SDK-wide invariant that every outbound call honors the user'sAPIConfigdeadline.References