Debounce `buildkite-agent job promise-failure` via the Job API by jordandcarter · Pull Request #3996 · buildkite/agent

jordandcarter · 2026-06-15T00:00:02Z

Description

Debounce repeated buildkite-agent job promise-failure calls in the agent so each exit status results in at most one Buildkite API call per job.

Customers are likely to call promise-failure from their test suite — potentially on every failing test. Each call spawns a new buildkite-agent process and would otherwise make its own API request, all of which the server must authenticate and process before discovering there's nothing new to record. For a suite with thousands of failures that's a lot of useless load on the Agent API.

This change moves the debouncing into the agent. The Job API server owns the declaration: each promise-failure invocation POSTs its exit status to the Job API, and the server declares the failure to the Buildkite API at most once per exit status. The first caller for a given exit status declares it, blocking on the API so it can return an accurate result; concurrent callers for the same exit status wait and share that outcome; later callers return from the cached success. Terminal failures (4xx other than 429, e.g. 409 or 422) are cached too, so repeated calls don't keep re-sending a declaration the API will always reject; transient failures (5xx, network, 429) aren't cached, so a later call can retry. Because the caller blocks until the result is known, the command exits with an accurate code rather than guessing.

The response reports an outcome of declared (this call reached the Buildkite API) or debounced (it shared an earlier declaration), and the CLI logs the distinction.

The Job API is job-scoped and enabled by default, so the state is naturally per-job (no keying by job ID) and there's no experiment to opt into. When the Job API is unavailable (--no-job-api, unsupported Windows) or can't be reached, the command falls back to declaring directly; the endpoint is idempotent for the same exit status, so a duplicate is safe.

This PR also drops the --job flag added to the command in #3992. The endpoint requires the job's own token and rejects declarations for any other job, so a promised failure only ever makes sense for the current job — the command now always targets BUILDKITE_JOB_ID.

Context

Follows on from #3992, which added the buildkite-agent job promise-failure command.

Changes

jobapi: the POST /api/current-job/v0/promise-failure endpoint now declares the promised failure to the Buildkite API itself, coalescing concurrent and repeated calls so each exit status is declared at most once successfully. The declarer is injected via WithPromiseFailureDeclarer. The response carries an outcome (declared/debounced).
clicommand: declare the promised failure through the Job API (blocking for an accurate result) and log whether it was declared or debounced; fall back to declaring directly when the Job API is unavailable or unreachable. Drop the --job flag and always target the current job.

Testing

Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
Code is formatted (with go tool gofumpt -extra -w .)

Added server tests over a real socket — sequential (first call declares, repeat is debounced, a different status declares on its own), concurrent (many simultaneous calls of the same status yield exactly one API declaration, all callers see success), error handling (terminal failures are cached, transient failures aren't and can retry; status normalization to 502), and panic safety — plus a client test for DeclarePromiseFailure.

Disclosures / Credits

I used Amp to design and implement the debouncing and tests, with GPT-5 for a code review pass. I reviewed all of the changes myself.

jordandcarter · 2026-06-15T00:06:57Z

@buildsworth-bk review

buildsworth-bk-app · 2026-06-15T00:07:27Z

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/685

buildsworth-bk-app

I found one blocking behavior issue around preserving the existing --job override; details inline.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 685, then answer my questions about the findings.

jordandcarter · 2026-06-15T01:08:00Z

@buildsworth-bk review

buildsworth-bk-app · 2026-06-15T01:08:31Z

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/690

Previous concerns have been addressed.

buildsworth-bk-app

Previous concern is resolved in this revision: the command no longer accepts a separate --job target, so the Job API claim and the Buildkite API declaration are now both scoped to BUILDKITE_JOB_ID. I didn't find any new high-confidence issues. I'm leaving this as a comment rather than an approval because this changes agent Job API/CLI behavior, so it isn't an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 690, then answer my questions about the findings.

moskyb

love love love this. it's the exact kind of thing we should be using the job API for. i have one readability thing, but other than that this looks great.

moskyb · 2026-06-16T02:41:29Z

+		// Debounce repeated calls via the Job API: customers may call this on
+		// every test failure, but each exit status only needs declaring to the
+		// Buildkite API once. The first caller to claim an exit status declares
+		// it; later callers exit early without touching the API. When the Job
+		// API is unavailable (--no-job-api, unsupported Windows, or run outside
+		// a job), we fall back to declaring directly.
+		if jobAPI, err := jobapi.NewDefaultClient(ctx); err != nil {
+			l.Debugf("Job API unavailable, declaring promised failure without debouncing: %v", err)
+		} else if claimed, err := jobAPI.PromiseFailureClaim(ctx, exitStatus); err != nil {
+			l.Warnf("Couldn't reach the Job API to debounce the promised failure; declaring it directly: %v", err)
+		} else if !claimed {
+			l.Debugf("Promised exit status %d has already been claimed; skipping duplicate", exitStatus)
+			return nil
+		}


this error flow is pretty confusing and not very idiomatic for go — i'd probably extract this into a function like

func shouldSkip(ctx context.Context, l logger.Logger status int) bool { jobAPI, err := jobapi.NewDefaultClient(ctx) if err != nil { l.Debugf("Job API unavailable, declaring promised failure without debouncing: %v", err) return false } claimed, err := // ...

then just return nil if shouldSkip returns true

Before seeing this comment I went on a journey and switched from claim-before-attempt (the code you reviewed) to claim-on-success, then decided it wasn't worth the complexity and have reverted back to what you reviewed.

Plus I addressed your error flow improvement 👍

jordandcarter · 2026-06-16T03:41:24Z

@buildsworth-bk review

buildsworth-bk-app · 2026-06-16T03:41:59Z

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/760

buildsworth-bk-app

I found one blocking issue in the debounce path: the current check-then-record flow can still let duplicate processes make multiple Agent API calls. Details inline.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 760, then answer my questions about the findings.

jordandcarter · 2026-06-16T04:24:49Z

@buildsworth-bk review

buildsworth-bk-app · 2026-06-16T04:25:23Z

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/763

Previous concerns have been addressed.

buildsworth-bk-app

Previous blocking concern is resolved in this revision: the Job API claim now records the exit status atomically before the Agent API declaration, so duplicate processes do not race through the debounce check. I didn't find any new high-confidence issues; leaving this as a comment rather than an approval because this changes Job API/CLI behavior, so it isn't an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 763, then answer my questions about the findings.

moskyb · 2026-06-16T23:38:46Z

as written it seems like the flow here is (and please correct me if i'm wrong):

Job calls promise-failure
promise-failure checks with the job API to see if the exit status has been claimed for this job
if the exit hasn't been claimed, it sends the HTTP request off to the agent API

in this situation, if the agent API returns an error for the first promised failure, doesn't that mean that all subsequent promise-failure calls will fail with a "exit status has already been claimed" message, even though it hasn't? if so, that would be a nasty surprise to callers.

i'm pretty sure you've already gone back and forth on this at least once, but if that is the case then i think we'll probably need to change the flow a bit. my recommendation would probably be to have the job API handler call the agent API, and conditionally set the cached value only if the request to the agent API succeeds.

jordandcarter · 2026-06-16T23:46:06Z

if the agent API returns an error for the first promised failure, doesn't that mean that all subsequent promise-failure calls will fail with a "exit status has already been claimed" message

Yes, if all 10 retries failed then future attempts would incorrectly be told the claim is already made. In such a case the job would continue until finished and then exit normally. I thought this rare case wasn't worth the complexity of moving the bk API call within the jobapi and handling simultaneous callers.

my recommendation would probably be to have the job API handler call the agent API, and conditionally set the cached value only if the request to the agent API succeeds.

Should the caller wait for the jobapi to make the API request? Does this cause simultaneous callers to be blocked until the jobapi has made the bk API call?

jordandcarter · 2026-06-17T04:38:47Z

@buildsworth-bk review

buildsworth-bk-app · 2026-06-17T04:39:28Z

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/808

buildsworth-bk-app

I didn’t find any high-confidence issues in the current revision. I’m leaving this as a comment rather than an approval because the change moves declaration behavior into the Job API/agent path, so it isn’t an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 808, then answer my questions about the findings.

Add a POST /api/current-job/v0/promise-failure endpoint that lets the 'job promise-failure' command debounce repeated calls. The server keeps a per-job set of exit statuses already claimed; the first caller to claim an exit status receives claimed=true and is responsible for declaring the promised failure to the Buildkite API, while later callers receive claimed=false so repeated calls (e.g. on every test failure) don't hammer the API. The Job API is job-scoped and enabled by default, so it needs no extra state keyed by job ID and works without opting into an experiment.

Before declaring a promised failure to the Buildkite API, claim the exit status through the Job API. Only the first caller for a given exit status contacts the API; later callers log at debug level and exit successfully. When the Job API is unavailable (--no-job-api, unsupported Windows, or run outside a job) the command falls back to declaring directly. Drop the --job flag and always target BUILDKITE_JOB_ID. The API authenticates with the job's own token and rejects declarations for any other job, so a promised failure only ever makes sense for the current job; a --job override could never succeed.

Per review feedback, move the Buildkite API call into the Job API server so it owns declaring promised failures. The first caller declares the exit status, blocking on the API so it can return an accurate result; concurrent callers wait and share that outcome; later callers return from the cached success. Failures aren't cached, so a later call can retry. This replaces the earlier 'claim' approach where the CLI made the API call itself after claiming an exit status, which couldn't give a blocked caller an accurate exit code.

The /promise-failure response now carries an outcome ('declared' or 'debounced') so callers can tell whether their call reached the Buildkite API or shared an earlier declaration. The CLI logs the distinction.

Previously every failed declaration was evicted so a later call could retry. That meant terminal client errors (e.g. 409 when a different exit status was already declared, or 422 when the job is no longer running) were re-sent to the Buildkite API on every call — the repeated load this feature exists to avoid. Cache terminal failures (4xx other than 429) alongside successes, so repeated calls return the cached result. Transient failures (5xx, network, 429) are still evicted so a later call can retry.

Extract the shared PromiseFailure retry loop into api.Client.PromiseFailureWithRetry, used by both the executor (Job API path) and the CLI direct fallback. This removes the duplicated roko loop and lets the fallback read the HTTP status directly instead of digging it out of an ErrorResponse. A warnf callback bridges the two callers' logger interfaces (shell.Logger.Warningf vs logger.Logger.Warnf). Drop TestPromiseFailureConcurrentError (coalescing is covered by TestPromiseFailureConcurrent, transient retry by the not-cached test) and TestPromiseFailurePanicRecovers, folding its panic-eviction-then-retry assertion into TestPromiseFailurePanicConcurrent.

Rename the handler method to handlePromiseFailure so it no longer shares a name with the promiseFailure struct type, and document that debouncing keys on exit status only, so the first caller's reason wins for a given exit status.

jordandcarter · 2026-06-17T20:54:09Z

@buildsworth-bk review

buildsworth-bk-app · 2026-06-17T20:54:55Z

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/831

buildsworth-bk-app

I didn't find any high-confidence issues in the current revision. I'm leaving this as a comment rather than an approval because this changes Job API/CLI behavior, so it isn't an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 831, then answer my questions about the findings.

Repeated calls to job promise-failure with the same exit status are debounced. Log that outcome at debug instead of info to avoid spamming job logs.

jordandcarter requested review from a team as code owners June 15, 2026 00:00

jordandcarter added the feature New user-facing feature! label Jun 15, 2026

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 40dbd64 to 83b7dfc Compare June 15, 2026 00:05

buildsworth-bk-app Bot previously requested changes Jun 15, 2026

View reviewed changes

Comment thread clicommand/job_promise_failure.go Outdated

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 5e03e27 to 37f4041 Compare June 15, 2026 01:01

buildsworth-bk-app Bot reviewed Jun 15, 2026

View reviewed changes

jordandcarter force-pushed the pb-1911-extend-agent-cli branch from 3d61536 to 58c96c2 Compare June 15, 2026 02:36

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 37f4041 to 7402f7c Compare June 15, 2026 02:39

jordandcarter force-pushed the pb-1911-extend-agent-cli branch from 58c96c2 to 1169b9f Compare June 15, 2026 02:57

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 98e02eb to d9b2b7b Compare June 15, 2026 03:57

jordandcarter force-pushed the pb-1911-extend-agent-cli branch from 1e068ef to 29f2a62 Compare June 15, 2026 23:01

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 6977367 to ae9eb17 Compare June 16, 2026 02:39

moskyb reviewed Jun 16, 2026

View reviewed changes

Base automatically changed from pb-1911-extend-agent-cli to main June 16, 2026 02:46

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 4 times, most recently from 368ccb4 to b9485aa Compare June 16, 2026 03:37

buildsworth-bk-app Bot previously requested changes Jun 16, 2026

View reviewed changes

Comment thread clicommand/job_promise_failure.go Outdated

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 5910ee8 to 9b973f7 Compare June 16, 2026 04:22

buildsworth-bk-app Bot reviewed Jun 16, 2026

View reviewed changes

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 9b973f7 to cab33a0 Compare June 16, 2026 21:25

jordandcarter requested a review from moskyb June 16, 2026 23:27

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 56c552a to 62fe859 Compare June 17, 2026 04:30

buildsworth-bk-app Bot reviewed Jun 17, 2026

View reviewed changes

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 9a2ce3b to a404178 Compare June 17, 2026 09:40

jordandcarter added 7 commits June 18, 2026 08:51

jobapi: report whether a promised failure was declared or debounced

c09e76e

The /promise-failure response now carries an outcome ('declared' or 'debounced') so callers can tell whether their call reached the Buildkite API or shared an earlier declaration. The CLI logs the distinction.

jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from a404178 to cca7976 Compare June 17, 2026 20:53

buildsworth-bk-app Bot reviewed Jun 17, 2026

View reviewed changes

Log debounced promise-failure at debug level

f55ee3f

Repeated calls to job promise-failure with the same exit status are debounced. Log that outcome at debug instead of info to avoid spamming job logs.

Conversation

jordandcarter commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Changes

Testing

Disclosures / Credits

Uh oh!

jordandcarter commented Jun 15, 2026

Uh oh!

buildsworth-bk-app Bot commented Jun 15, 2026

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jordandcarter commented Jun 15, 2026

Uh oh!

buildsworth-bk-app Bot commented Jun 15, 2026

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

moskyb left a comment

Choose a reason for hiding this comment

Uh oh!

moskyb Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jordandcarter Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

jordandcarter commented Jun 16, 2026

Uh oh!

buildsworth-bk-app Bot commented Jun 16, 2026

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jordandcarter commented Jun 16, 2026

Uh oh!

buildsworth-bk-app Bot commented Jun 16, 2026

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

moskyb commented Jun 16, 2026

Uh oh!

jordandcarter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordandcarter commented Jun 17, 2026

Uh oh!

buildsworth-bk-app Bot commented Jun 17, 2026

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jordandcarter commented Jun 17, 2026

Uh oh!

buildsworth-bk-app Bot commented Jun 17, 2026

Uh oh!

buildsworth-bk-app Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jordandcarter commented Jun 15, 2026 •

edited

Loading

jordandcarter commented Jun 16, 2026 •

edited

Loading