Skip to content

Deadlocks in auth to repositories #6633

@glightfoot

Description

@glightfoot

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

buildkitd can become effectively wedged when a client disconnects unexpectedly during registry auth callback flow (notably session/auth.VerifyTokenAuthority during 401 Unauthorized handling).

Observed behavior

  • A build triggers registry auth callback.
  • The client disconnects ungracefully (silent network drop, half-open TCP, abrupt termination).
  • The callback path can block indefinitely waiting for session/auth response.
  • That blocked request holds resolver/auth synchronization long enough to cause:
    • lock contention in authorizer/resolver paths
    • request pileup in flightcontrol waiters for image resolution
  • The daemon often still appears healthy (CPU/memory), but builds requiring registry resolution stop progressing until buildkitd restart.

Expected behavior

  • Dead/disconnected client sessions should be detected and fail in bounded time.
  • Auth callback failures should not allow indefinite lock hold.
  • Unrelated builds/auth flows should continue to make progress even when one auth path is slow or dead.

Related fixes in progress

  • PR #6630: timeout/keepalive safeguards to prevent indefinite deadlock behavior.
  • PR #6631: authorizer/session-manager concurrency refactors to reduce lock contention and blast radius.

Goroutine dump evidence

Snippet 1: Root cause (1 waiter in auth callback/session wait path)

1 @ ...
#	0x482fd8	sync.runtime_notifyListWait+0x138								/usr/local/go/src/runtime/sema.go:606
#	0x493e52	sync.(*Cond).Wait+0x72										/usr/local/go/src/sync/cond.go:71
#	0xabee2b	github.com/moby/buildkit/session.(*Manager).Get+0x1cb						/src/session/manager.go:179
#	0xabbfc4	github.com/moby/buildkit/session.(*Manager).Any+0x244						/src/session/group.go:78
#	0x12bd7f8	github.com/moby/buildkit/session/auth.VerifyTokenAuthority+0xb8					/src/session/auth/auth.go:73
#	0x12c9f64	github.com/moby/buildkit/util/resolver.(*authHandlerNS).get+0x1a4				/src/util/resolver/authorizer.go:74
#	0x12ca74f	github.com/moby/buildkit/util/resolver.(*dockerAuthorizer).Authorize+0xef			/src/util/resolver/authorizer.go:129
#	0xc7e70e	github.com/containerd/containerd/v2/core/remotes/docker.(*request).authorize+0x2e		/src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:546
#	0xc7f2de	github.com/containerd/containerd/v2/core/remotes/docker.(*request).do+0x2de			/src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:621
#	0xc7fd2e	github.com/containerd/containerd/v2/core/remotes/docker.(*request).doWithRetriesInner+0x4e	/src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:717
#	0xc7fae8	github.com/containerd/containerd/v2/core/remotes/docker.(*request).doWithRetries+0x68		/src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:698
#	0xc7c464	github.com/containerd/containerd/v2/core/remotes/docker.(*dockerResolver).Resolve+0xb24		/src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:308
#	0x12cfdc8	github.com/moby/buildkit/util/resolver.(*Resolver).Resolve+0x208				/src/util/resolver/pool.go:250
#	0xcc5028	github.com/moby/buildkit/util/imageutil.Config+0x508						/src/util/imageutil/config.go:107
#	0x1aeee70	github.com/moby/buildkit/source/containerimage.(*Source).ResolveImageMetadata.func2+0x90	/src/source/containerimage/source.go:185
#	0xdf142c	github.com/moby/buildkit/util/flightcontrol.(*call[...]).run+0x14c				/src/util/flightcontrol/flightcontrol.go:122

Snippet 2: Lock contention (at least 14 waiters total across resolver/authorizer locks)

12 @ ...
#	0x482ca4	internal/sync.runtime_SemacquireMutex+0x24								/usr/local/go/src/runtime/sema.go:95
#	0x493b7c	internal/sync.(*Mutex).lockSlow+0x15c									/usr/local/go/src/internal/sync/mutex.go:149
#	0x12ce70a	github.com/moby/buildkit/util/resolver.(*Pool).GetResolver+0x2ea					/src/util/resolver/pool.go:103
#	0x1aed574	github.com/moby/buildkit/source/containerimage.(*Source).ResolveImageMetadata+0x2b4			/src/source/containerimage/source.go:179
1 @ ...
#	0x482ca4	internal/sync.runtime_SemacquireMutex+0x24							/usr/local/go/src/runtime/sema.go:95
#	0x493b7c	internal/sync.(*Mutex).lockSlow+0x15c								/usr/local/go/src/internal/sync/mutex.go:149
#	0x12ca6cd	sync.(*Mutex).Lock+0x6d										/usr/local/go/src/sync/mutex.go:46
#	0x12ca6af	github.com/moby/buildkit/util/resolver.(*dockerAuthorizer).Authorize+0x4f			/src/util/resolver/authorizer.go:125
#	0xc7e70e	github.com/containerd/containerd/v2/core/remotes/docker.(*request).authorize+0x2e		/src/vendor/github.com/containerd/containerd/v2/core/remotes/docker/resolver.go:546
1 @ ...
#	0x482ca4	internal/sync.runtime_SemacquireMutex+0x24							/usr/local/go/src/runtime/sema.go:95
#	0x493b7c	internal/sync.(*Mutex).lockSlow+0x15c								/usr/local/go/src/internal/sync/mutex.go:149
#	0x12cadaf	sync.(*Mutex).Lock+0x8f										/usr/local/go/src/sync/mutex.go:46
#	0x12cad94	github.com/moby/buildkit/util/resolver.(*dockerAuthorizer).AddResponses+0x74			/src/util/resolver/authorizer.go:150

Snippet 3: Flightcontrol pileup (19 waiters observed)

10 @ ...
#	0xdf0f97	github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7	/src/util/flightcontrol/flightcontrol.go:168
#	0xdf0033	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x233	/src/util/flightcontrol/flightcontrol.go:79
#	0xdf04d2	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92	/src/util/flightcontrol/flightcontrol.go:37
#	0x179670d	github.com/moby/buildkit/util/flightcontrol.(*CachedGroup[...]).Do+0xcd	/src/util/flightcontrol/cached.go:30
#	0x19853c7	github.com/moby/buildkit/frontend/dockerfile/builder.(*withResolveCache).ResolveImageConfig+0x1e7	/src/frontend/dockerfile/builder/resolvecache.go:36
5 @ ...
#	0xdf105a	github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x57a	/src/util/flightcontrol/flightcontrol.go:173
#	0xdf0033	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x233	/src/util/flightcontrol/flightcontrol.go:79
#	0xdf04d2	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92	/src/util/flightcontrol/flightcontrol.go:37
#	0x179670d	github.com/moby/buildkit/util/flightcontrol.(*CachedGroup[...]).Do+0xcd	/src/util/flightcontrol/cached.go:30
#	0x19853c7	github.com/moby/buildkit/frontend/dockerfile/builder.(*withResolveCache).ResolveImageConfig+0x1e7	/src/frontend/dockerfile/builder/resolvecache.go:36
2 @ ...
#	0xdf0f97	github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7	/src/util/flightcontrol/flightcontrol.go:168
#	0xdf0033	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x233	/src/util/flightcontrol/flightcontrol.go:79
#	0xdf04d2	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92	/src/util/flightcontrol/flightcontrol.go:37
#	0x1aed7ad	github.com/moby/buildkit/source/containerimage.(*Source).ResolveImageMetadata+0x4ed	/src/source/containerimage/source.go:184
1 @ ...
#	0xdf0f97	github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7	/src/util/flightcontrol/flightcontrol.go:168
#	0xdeff19	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x119	/src/util/flightcontrol/flightcontrol.go:65
#	0xdf04d2	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92	/src/util/flightcontrol/flightcontrol.go:37
1 @ ...
#	0xdf0f97	github.com/moby/buildkit/util/flightcontrol.(*call[...]).wait+0x4b7	/src/util/flightcontrol/flightcontrol.go:168
#	0xdeff19	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).do+0x119	/src/util/flightcontrol/flightcontrol.go:65
#	0xdf04d2	github.com/moby/buildkit/util/flightcontrol.(*Group[...]).Do+0x92	/src/util/flightcont

Reproduction

  1. Start a buildkitd instance with remote client/session-based auth callback behavior.
  2. Trigger a build that requires pulling a private image (or any registry flow that returns 401 and requires callback credential resolution).
  3. During the auth callback window, drop the client connection ungracefully (for example, kill client process, blackhole network traffic, or simulate half-open TCP).
  4. Start additional builds that resolve the same image (and/or other registry-backed builds).
  5. Observe builds blocking indefinitely and goroutine accumulation in auth/resolver/flightcontrol paths.

While auth is in-flight, terminate or disconnect the client ungracefully.
Then retry additional builds and observe wedged behavior.

Version information

  • buildkitd --version
    0.28.1-rootless
  • buildctl --version
    0.20.2
  • docker buildx version (if applicable)
    v0.31.1-desktop.1

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions