Skip to content

Debounce buildkite-agent job promise-failure via the Job API#3996

Open
jordandcarter wants to merge 8 commits into
mainfrom
pb-1901-agent-debouncing-promised-failure
Open

Debounce buildkite-agent job promise-failure via the Job API#3996
jordandcarter wants to merge 8 commits into
mainfrom
pb-1901-agent-debouncing-promised-failure

Conversation

@jordandcarter

@jordandcarter jordandcarter commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Debounce repeated buildkite-agent job promise-failure calls in the agent so each exit status results in at most one Buildkite API call per job.

Customers are likely to call promise-failure from their test suite — potentially on every failing test. Each call spawns a new buildkite-agent process and would otherwise make its own API request, all of which the server must authenticate and process before discovering there's nothing new to record. For a suite with thousands of failures that's a lot of useless load on the Agent API.

This change moves the debouncing into the agent. The Job API server owns the declaration: each promise-failure invocation POSTs its exit status to the Job API, and the server declares the failure to the Buildkite API at most once per exit status. The first caller for a given exit status declares it, blocking on the API so it can return an accurate result; concurrent callers for the same exit status wait and share that outcome; later callers return from the cached success. Terminal failures (4xx other than 429, e.g. 409 or 422) are cached too, so repeated calls don't keep re-sending a declaration the API will always reject; transient failures (5xx, network, 429) aren't cached, so a later call can retry. Because the caller blocks until the result is known, the command exits with an accurate code rather than guessing.

The response reports an outcome of declared (this call reached the Buildkite API) or debounced (it shared an earlier declaration), and the CLI logs the distinction.

The Job API is job-scoped and enabled by default, so the state is naturally per-job (no keying by job ID) and there's no experiment to opt into. When the Job API is unavailable (--no-job-api, unsupported Windows) or can't be reached, the command falls back to declaring directly; the endpoint is idempotent for the same exit status, so a duplicate is safe.

This PR also drops the --job flag added to the command in #3992. The endpoint requires the job's own token and rejects declarations for any other job, so a promised failure only ever makes sense for the current job — the command now always targets BUILDKITE_JOB_ID.

Context

Follows on from #3992, which added the buildkite-agent job promise-failure command.

Changes

  • jobapi: the POST /api/current-job/v0/promise-failure endpoint now declares the promised failure to the Buildkite API itself, coalescing concurrent and repeated calls so each exit status is declared at most once successfully. The declarer is injected via WithPromiseFailureDeclarer. The response carries an outcome (declared/debounced).
  • clicommand: declare the promised failure through the Job API (blocking for an accurate result) and log whether it was declared or debounced; fall back to declaring directly when the Job API is unavailable or unreachable. Drop the --job flag and always target the current job.

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Added server tests over a real socket — sequential (first call declares, repeat is debounced, a different status declares on its own), concurrent (many simultaneous calls of the same status yield exactly one API declaration, all callers see success), error handling (terminal failures are cached, transient failures aren't and can retry; status normalization to 502), and panic safety — plus a client test for DeclarePromiseFailure.

Disclosures / Credits

I used Amp to design and implement the debouncing and tests, with GPT-5 for a code review pass. I reviewed all of the changes myself.

@jordandcarter jordandcarter requested review from a team as code owners June 15, 2026 00:00
@jordandcarter jordandcarter added the feature New user-facing feature! label Jun 15, 2026
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 40dbd64 to 83b7dfc Compare June 15, 2026 00:05
@jordandcarter

Copy link
Copy Markdown
Contributor Author

@buildsworth-bk review

@buildsworth-bk-app

Copy link
Copy Markdown

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/685

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking behavior issue around preserving the existing --job override; details inline.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 685, then answer my questions about the findings.

Comment thread clicommand/job_promise_failure.go Outdated
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 5e03e27 to 37f4041 Compare June 15, 2026 01:01
@jordandcarter

Copy link
Copy Markdown
Contributor Author

@buildsworth-bk review

@buildsworth-bk-app

Copy link
Copy Markdown

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/690

@buildsworth-bk-app buildsworth-bk-app Bot dismissed their stale review June 15, 2026 01:12

Previous concerns have been addressed.

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous concern is resolved in this revision: the command no longer accepts a separate --job target, so the Job API claim and the Buildkite API declaration are now both scoped to BUILDKITE_JOB_ID. I didn't find any new high-confidence issues. I'm leaving this as a comment rather than an approval because this changes agent Job API/CLI behavior, so it isn't an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 690, then answer my questions about the findings.

@jordandcarter jordandcarter force-pushed the pb-1911-extend-agent-cli branch from 3d61536 to 58c96c2 Compare June 15, 2026 02:36
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 37f4041 to 7402f7c Compare June 15, 2026 02:39
@jordandcarter jordandcarter force-pushed the pb-1911-extend-agent-cli branch from 58c96c2 to 1169b9f Compare June 15, 2026 02:57
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 98e02eb to d9b2b7b Compare June 15, 2026 03:57
@jordandcarter jordandcarter force-pushed the pb-1911-extend-agent-cli branch from 1e068ef to 29f2a62 Compare June 15, 2026 23:01
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 6977367 to ae9eb17 Compare June 16, 2026 02:39

@moskyb moskyb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love love love this. it's the exact kind of thing we should be using the job API for. i have one readability thing, but other than that this looks great.

Comment thread clicommand/job_promise_failure.go Outdated
Comment on lines +94 to +107
// Debounce repeated calls via the Job API: customers may call this on
// every test failure, but each exit status only needs declaring to the
// Buildkite API once. The first caller to claim an exit status declares
// it; later callers exit early without touching the API. When the Job
// API is unavailable (--no-job-api, unsupported Windows, or run outside
// a job), we fall back to declaring directly.
if jobAPI, err := jobapi.NewDefaultClient(ctx); err != nil {
l.Debugf("Job API unavailable, declaring promised failure without debouncing: %v", err)
} else if claimed, err := jobAPI.PromiseFailureClaim(ctx, exitStatus); err != nil {
l.Warnf("Couldn't reach the Job API to debounce the promised failure; declaring it directly: %v", err)
} else if !claimed {
l.Debugf("Promised exit status %d has already been claimed; skipping duplicate", exitStatus)
return nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error flow is pretty confusing and not very idiomatic for go — i'd probably extract this into a function like

func shouldSkip(ctx context.Context, l logger.Logger status int) bool {
	jobAPI, err := jobapi.NewDefaultClient(ctx)
	if err != nil {
		l.Debugf("Job API unavailable, declaring promised failure without debouncing: %v", err)
		return false
	}

	claimed, err := // ...

then just return nil if shouldSkip returns true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before seeing this comment I went on a journey and switched from claim-before-attempt (the code you reviewed) to claim-on-success, then decided it wasn't worth the complexity and have reverted back to what you reviewed.

Plus I addressed your error flow improvement 👍

Base automatically changed from pb-1911-extend-agent-cli to main June 16, 2026 02:46
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 4 times, most recently from 368ccb4 to b9485aa Compare June 16, 2026 03:37
@jordandcarter

Copy link
Copy Markdown
Contributor Author

@buildsworth-bk review

@buildsworth-bk-app

Copy link
Copy Markdown

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/760

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking issue in the debounce path: the current check-then-record flow can still let duplicate processes make multiple Agent API calls. Details inline.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 760, then answer my questions about the findings.

Comment thread clicommand/job_promise_failure.go Outdated
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch 2 times, most recently from 5910ee8 to 9b973f7 Compare June 16, 2026 04:22
@jordandcarter

Copy link
Copy Markdown
Contributor Author

@buildsworth-bk review

@buildsworth-bk-app

Copy link
Copy Markdown

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/763

@buildsworth-bk-app buildsworth-bk-app Bot dismissed their stale review June 16, 2026 04:29

Previous concerns have been addressed.

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous blocking concern is resolved in this revision: the Job API claim now records the exit status atomically before the Agent API declaration, so duplicate processes do not race through the debounce check. I didn't find any new high-confidence issues; leaving this as a comment rather than an approval because this changes Job API/CLI behavior, so it isn't an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 763, then answer my questions about the findings.

@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 9b973f7 to cab33a0 Compare June 16, 2026 21:25
@jordandcarter jordandcarter requested a review from moskyb June 16, 2026 23:27
@moskyb

moskyb commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

as written it seems like the flow here is (and please correct me if i'm wrong):

  • Job calls promise-failure
  • promise-failure checks with the job API to see if the exit status has been claimed for this job
  • if the exit hasn't been claimed, it sends the HTTP request off to the agent API

in this situation, if the agent API returns an error for the first promised failure, doesn't that mean that all subsequent promise-failure calls will fail with a "exit status has already been claimed" message, even though it hasn't? if so, that would be a nasty surprise to callers.

i'm pretty sure you've already gone back and forth on this at least once, but if that is the case then i think we'll probably need to change the flow a bit. my recommendation would probably be to have the job API handler call the agent API, and conditionally set the cached value only if the request to the agent API succeeds.

@jordandcarter

jordandcarter commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

if the agent API returns an error for the first promised failure, doesn't that mean that all subsequent promise-failure calls will fail with a "exit status has already been claimed" message

Yes, if all 10 retries failed then future attempts would incorrectly be told the claim is already made. In such a case the job would continue until finished and then exit normally. I thought this rare case wasn't worth the complexity of moving the bk API call within the jobapi and handling simultaneous callers.

my recommendation would probably be to have the job API handler call the agent API, and conditionally set the cached value only if the request to the agent API succeeds.

Should the caller wait for the jobapi to make the API request? Does this cause simultaneous callers to be blocked until the jobapi has made the bk API call?

@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 56c552a to 62fe859 Compare June 17, 2026 04:30
@jordandcarter

Copy link
Copy Markdown
Contributor Author

@buildsworth-bk review

@buildsworth-bk-app

Copy link
Copy Markdown

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/808

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t find any high-confidence issues in the current revision. I’m leaving this as a comment rather than an approval because the change moves declaration behavior into the Job API/agent path, so it isn’t an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 808, then answer my questions about the findings.

@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from 9a2ce3b to a404178 Compare June 17, 2026 09:40
Add a POST /api/current-job/v0/promise-failure endpoint that lets the
'job promise-failure' command debounce repeated calls. The server keeps a
per-job set of exit statuses already claimed; the first caller to claim an
exit status receives claimed=true and is responsible for declaring the
promised failure to the Buildkite API, while later callers receive
claimed=false so repeated calls (e.g. on every test failure) don't hammer
the API.

The Job API is job-scoped and enabled by default, so it needs no extra
state keyed by job ID and works without opting into an experiment.
Before declaring a promised failure to the Buildkite API, claim the exit
status through the Job API. Only the first caller for a given exit status
contacts the API; later callers log at debug level and exit successfully.
When the Job API is unavailable (--no-job-api, unsupported Windows, or run
outside a job) the command falls back to declaring directly.

Drop the --job flag and always target BUILDKITE_JOB_ID. The API
authenticates with the job's own token and rejects declarations for any
other job, so a promised failure only ever makes sense for the current
job; a --job override could never succeed.
Per review feedback, move the Buildkite API call into the Job API server
so it owns declaring promised failures. The first caller declares the
exit status, blocking on the API so it can return an accurate result;
concurrent callers wait and share that outcome; later callers return from
the cached success. Failures aren't cached, so a later call can retry.

This replaces the earlier 'claim' approach where the CLI made the API
call itself after claiming an exit status, which couldn't give a blocked
caller an accurate exit code.
The /promise-failure response now carries an outcome ('declared' or
'debounced') so callers can tell whether their call reached the Buildkite
API or shared an earlier declaration. The CLI logs the distinction.
Previously every failed declaration was evicted so a later call could
retry. That meant terminal client errors (e.g. 409 when a different exit
status was already declared, or 422 when the job is no longer running)
were re-sent to the Buildkite API on every call — the repeated load this
feature exists to avoid.

Cache terminal failures (4xx other than 429) alongside successes, so
repeated calls return the cached result. Transient failures (5xx, network,
429) are still evicted so a later call can retry.
Extract the shared PromiseFailure retry loop into
api.Client.PromiseFailureWithRetry, used by both the executor (Job API
path) and the CLI direct fallback. This removes the duplicated roko loop
and lets the fallback read the HTTP status directly instead of digging it
out of an ErrorResponse. A warnf callback bridges the two callers' logger
interfaces (shell.Logger.Warningf vs logger.Logger.Warnf).

Drop TestPromiseFailureConcurrentError (coalescing is covered by
TestPromiseFailureConcurrent, transient retry by the not-cached test) and
TestPromiseFailurePanicRecovers, folding its panic-eviction-then-retry
assertion into TestPromiseFailurePanicConcurrent.
Rename the handler method to handlePromiseFailure so it no longer shares a
name with the promiseFailure struct type, and document that debouncing
keys on exit status only, so the first caller's reason wins for a given
exit status.
@jordandcarter jordandcarter force-pushed the pb-1901-agent-debouncing-promised-failure branch from a404178 to cca7976 Compare June 17, 2026 20:53
@jordandcarter

Copy link
Copy Markdown
Contributor Author

@buildsworth-bk review

@buildsworth-bk-app

Copy link
Copy Markdown

On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/831

@buildsworth-bk-app buildsworth-bk-app Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find any high-confidence issues in the current revision. I'm leaving this as a comment rather than an approval because this changes Job API/CLI behavior, so it isn't an L1 review.

Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:

Download the buildsworth logs from build 831, then answer my questions about the findings.

Repeated calls to job promise-failure with the same exit status are
debounced. Log that outcome at debug instead of info to avoid spamming
job logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New user-facing feature!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants