-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Deadlocks in auth to repositories #6633
Copy link
Copy link
Labels
Description
Contributing guidelines and issue reporting guide
- I've read the contributing guidelines and wholeheartedly agree. I've also read the issue reporting guide.
Well-formed report checklist
- I have found a bug that the documentation does not mention anything about my problem
- I have found a bug that there are no open or closed issues that are related to my problem
- I have provided version/information about my environment and done my best to provide a reproducer
Description of bug
Bug description
buildkitd can become effectively wedged when a client disconnects unexpectedly during registry auth callback flow (notably session/auth.VerifyTokenAuthority during 401 Unauthorized handling).
Observed behavior
- A build triggers registry auth callback.
- The client disconnects ungracefully (silent network drop, half-open TCP, abrupt termination).
- The callback path can block indefinitely waiting for session/auth response.
- That blocked request holds resolver/auth synchronization long enough to cause:
- lock contention in authorizer/resolver paths
- request pileup in
flightcontrolwaiters for image resolution
- The daemon often still appears healthy (CPU/memory), but builds requiring registry resolution stop progressing until
buildkitdrestart.
Expected behavior
- Dead/disconnected client sessions should be detected and fail in bounded time.
- Auth callback failures should not allow indefinite lock hold.
- Unrelated builds/auth flows should continue to make progress even when one auth path is slow or dead.
Related fixes in progress
- PR #6630: timeout/keepalive safeguards to prevent indefinite deadlock behavior.
- PR #6631: authorizer/session-manager concurrency refactors to reduce lock contention and blast radius.
Goroutine dump evidence
Snippet 1: Root cause (1 waiter in auth callback/session wait path)
1 @ ...
# 0x482fd8 sync.runtime_notifyListWait+0x138 /usr/local/go/src/runtime/sema.go:606
# 0x493e52 sync.(*Cond).Wait+0x72 /usr/local/go/src/sync/cond.go:71
# 0xabee2b github.com/moby/buildkit/session.(*Manager).Get+0x1cb /src/session/manager.go:179
# 0xabbfc4 github.com/moby/buildkit/session.(*Manager).Any+0x244 /src/session/group.go:78
# 0x12bd7f8 github.com/moby/buildkit/session/auth.VerifyTokenAuthority+0xb8 /src/session/auth/auth.go:73
# 0x12c9f64 github.com/moby/buildkit/util/resolver.(*authHandlerNS).get+0x1a4 /src/util/resolver/authorizer.go:74
# 0x12ca74f github.com/moby/buildkit/util/resolver.(*dockerAuthorizer).Authorize+0xef /src/util/resolver/authorizer.go:129
# 0xc7e70e github.com/containerd/containerd/v2/core/remotes/docker.(*request).authorize+0x2e /src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:546
# 0xc7f2de github.com/containerd/containerd/v2/core/remotes/docker.(*request).do+0x2de /src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:621
# 0xc7fd2e github.com/containerd/containerd/v2/core/remotes/docker.(*request).doWithRetriesInner+0x4e /src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:717
# 0xc7fae8 github.com/containerd/containerd/v2/core/remotes/docker.(*request).doWithRetries+0x68 /src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:698
# 0xc7c464 github.com/containerd/containerd/v2/core/remotes/docker.(*dockerResolver).Resolve+0xb24 /src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:308
# 0x12cfdc8 github.com/moby/buildkit/util/resolver.(*Resolver).Resolve+0x208 /src/util/resolver/pool.go:250
# 0xcc5028 github.com/moby/buildkit/util/imageutil.Config+0x508 /src/util/imageutil/config.go:107
# 0x1aeee70 github.com/moby/buildkit/source/containerimage.(*Source).ResolveImageMetadata.func2+0x90 /src/source/containerimage/source.go:185
# 0xdf142c github.com/moby/buildkit/util/flightcontrol.(*call[...]).run+0x14c /src/util/flightcontrol/flightcontrol.go:122
Snippet 2: Lock contention (at least 14 waiters total across resolver/authorizer locks)
12 @ ...
# 0x482ca4 internal/sync.runtime_SemacquireMutex+0x24 /usr/local/go/src/runtime/sema.go:95
# 0x493b7c internal/sync.(*Mutex).lockSlow+0x15c /usr/local/go/src/internal/sync/mutex.go:149
# 0x12ce70a github.com/moby/buildkit/util/resolver.(*Pool).GetResolver+0x2ea /src/util/resolver/pool.go:103
# 0x1aed574 github.com/moby/buildkit/source/containerimage.(*Source).ResolveImageMetadata+0x2b4 /src/source/containerimage/source.go:179
1 @ ...
# 0x482ca4 internal/sync.runtime_SemacquireMutex+0x24 /usr/local/go/src/runtime/sema.go:95
# 0x493b7c internal/sync.(*Mutex).lockSlow+0x15c /usr/local/go/src/internal/sync/mutex.go:149
# 0x12ca6cd sync.(*Mutex).Lock+0x6d /usr/local/go/src/sync/mutex.go:46
# 0x12ca6af github.com/moby/buildkit/util/resolver.(*dockerAuthorizer).Authorize+0x4f /src/util/resolver/authorizer.go:125
# 0xc7e70e github.com/containerd/containerd/v2/core/remotes/docker.(*request).authorize+0x2e /src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:546
1 @ ...
# 0x482ca4 internal/sync.runtime_SemacquireMutex+0x24 /usr/local/go/src/runtime/sema.go:95
# 0x493b7c internal/sync.(*Mutex).lockSlow+0x15c /usr/local/go/src/internal/sync/mutex.go:149
# 0x12cadaf sync.(*Mutex).Lock+0x8f /usr/local/go/src/sync/mutex.go:46
# 0x12cad94 github.com/moby/buildkit/util/resolver.(*dockerAuthorizer).AddResponses+0x74 /src/util/resolver/authorizer.go:150
Snippet 3: Flightcontrol pileup (19 waiters observed)
10 @ ...
# 0xdf0f97 github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7 /src/util/flightcontrol/flightcontrol.go:168
# 0xdf0033 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x233 /src/util/flightcontrol/flightcontrol.go:79
# 0xdf04d2 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92 /src/util/flightcontrol/flightcontrol.go:37
# 0x179670d github.com/moby/buildkit/util/flightcontrol.(*CachedGroup[...]).Do+0xcd /src/util/flightcontrol/cached.go:30
# 0x19853c7 github.com/moby/buildkit/frontend/dockerfile/builder.(*withResolveCache).ResolveImageConfig+0x1e7 /src/frontend/dockerfile/builder/resolvecache.go:36
5 @ ...
# 0xdf105a github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x57a /src/util/flightcontrol/flightcontrol.go:173
# 0xdf0033 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x233 /src/util/flightcontrol/flightcontrol.go:79
# 0xdf04d2 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92 /src/util/flightcontrol/flightcontrol.go:37
# 0x179670d github.com/moby/buildkit/util/flightcontrol.(*CachedGroup[...]).Do+0xcd /src/util/flightcontrol/cached.go:30
# 0x19853c7 github.com/moby/buildkit/frontend/dockerfile/builder.(*withResolveCache).ResolveImageConfig+0x1e7 /src/frontend/dockerfile/builder/resolvecache.go:36
2 @ ...
# 0xdf0f97 github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7 /src/util/flightcontrol/flightcontrol.go:168
# 0xdf0033 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x233 /src/util/flightcontrol/flightcontrol.go:79
# 0xdf04d2 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92 /src/util/flightcontrol/flightcontrol.go:37
# 0x1aed7ad github.com/moby/buildkit/source/containerimage.(*Source).ResolveImageMetadata+0x4ed /src/source/containerimage/source.go:184
1 @ ...
# 0xdf0f97 github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7 /src/util/flightcontrol/flightcontrol.go:168
# 0xdeff19 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x119 /src/util/flightcontrol/flightcontrol.go:65
# 0xdf04d2 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92 /src/util/flightcontrol/flightcontrol.go:37
1 @ ...
# 0xdf0f97 github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7 /src/util/flightcontrol/flightcontrol.go:168
# 0xdeff19 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x119 /src/util/flightcontrol/flightcontrol.go:65
# 0xdf04d2 github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92 /src/util/flightcont
Reproduction
- Start a buildkitd instance with remote client/session-based auth callback behavior.
- Trigger a build that requires pulling a private image (or any registry flow that returns 401 and requires callback credential resolution).
- During the auth callback window, drop the client connection ungracefully (for example, kill client process, blackhole network traffic, or simulate half-open TCP).
- Start additional builds that resolve the same image (and/or other registry-backed builds).
- Observe builds blocking indefinitely and goroutine accumulation in auth/resolver/flightcontrol paths.
While auth is in-flight, terminate or disconnect the client ungracefully.
Then retry additional builds and observe wedged behavior.
Version information
- buildkitd --version
0.28.1-rootless - buildctl --version
0.20.2 - docker buildx version (if applicable)
v0.31.1-desktop.1
Reactions are currently unavailable