Debounce buildkite-agent job promise-failure via the Job API#3996
Debounce buildkite-agent job promise-failure via the Job API#3996jordandcarter wants to merge 8 commits into
buildkite-agent job promise-failure via the Job API#3996Conversation
40dbd64 to
83b7dfc
Compare
|
@buildsworth-bk review |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/685 |
There was a problem hiding this comment.
I found one blocking behavior issue around preserving the existing --job override; details inline.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 685, then answer my questions about the findings.
5e03e27 to
37f4041
Compare
|
@buildsworth-bk review |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/690 |
Previous concerns have been addressed.
There was a problem hiding this comment.
Previous concern is resolved in this revision: the command no longer accepts a separate --job target, so the Job API claim and the Buildkite API declaration are now both scoped to BUILDKITE_JOB_ID. I didn't find any new high-confidence issues. I'm leaving this as a comment rather than an approval because this changes agent Job API/CLI behavior, so it isn't an L1 review.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 690, then answer my questions about the findings.
3d61536 to
58c96c2
Compare
37f4041 to
7402f7c
Compare
58c96c2 to
1169b9f
Compare
98e02eb to
d9b2b7b
Compare
1e068ef to
29f2a62
Compare
6977367 to
ae9eb17
Compare
moskyb
left a comment
There was a problem hiding this comment.
love love love this. it's the exact kind of thing we should be using the job API for. i have one readability thing, but other than that this looks great.
| // Debounce repeated calls via the Job API: customers may call this on | ||
| // every test failure, but each exit status only needs declaring to the | ||
| // Buildkite API once. The first caller to claim an exit status declares | ||
| // it; later callers exit early without touching the API. When the Job | ||
| // API is unavailable (--no-job-api, unsupported Windows, or run outside | ||
| // a job), we fall back to declaring directly. | ||
| if jobAPI, err := jobapi.NewDefaultClient(ctx); err != nil { | ||
| l.Debugf("Job API unavailable, declaring promised failure without debouncing: %v", err) | ||
| } else if claimed, err := jobAPI.PromiseFailureClaim(ctx, exitStatus); err != nil { | ||
| l.Warnf("Couldn't reach the Job API to debounce the promised failure; declaring it directly: %v", err) | ||
| } else if !claimed { | ||
| l.Debugf("Promised exit status %d has already been claimed; skipping duplicate", exitStatus) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
this error flow is pretty confusing and not very idiomatic for go — i'd probably extract this into a function like
func shouldSkip(ctx context.Context, l logger.Logger status int) bool {
jobAPI, err := jobapi.NewDefaultClient(ctx)
if err != nil {
l.Debugf("Job API unavailable, declaring promised failure without debouncing: %v", err)
return false
}
claimed, err := // ...then just return nil if shouldSkip returns true
There was a problem hiding this comment.
Before seeing this comment I went on a journey and switched from claim-before-attempt (the code you reviewed) to claim-on-success, then decided it wasn't worth the complexity and have reverted back to what you reviewed.
Plus I addressed your error flow improvement 👍
368ccb4 to
b9485aa
Compare
|
@buildsworth-bk review |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/760 |
There was a problem hiding this comment.
I found one blocking issue in the debounce path: the current check-then-record flow can still let duplicate processes make multiple Agent API calls. Details inline.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 760, then answer my questions about the findings.
5910ee8 to
9b973f7
Compare
|
@buildsworth-bk review |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/763 |
Previous concerns have been addressed.
There was a problem hiding this comment.
Previous blocking concern is resolved in this revision: the Job API claim now records the exit status atomically before the Agent API declaration, so duplicate processes do not race through the debounce check. I didn't find any new high-confidence issues; leaving this as a comment rather than an approval because this changes Job API/CLI behavior, so it isn't an L1 review.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 763, then answer my questions about the findings.
9b973f7 to
cab33a0
Compare
|
as written it seems like the flow here is (and please correct me if i'm wrong):
in this situation, if the agent API returns an error for the first promised failure, doesn't that mean that all subsequent i'm pretty sure you've already gone back and forth on this at least once, but if that is the case then i think we'll probably need to change the flow a bit. my recommendation would probably be to have the job API handler call the agent API, and conditionally set the cached value only if the request to the agent API succeeds. |
Yes, if all 10 retries failed then future attempts would incorrectly be told the claim is already made. In such a case the job would continue until finished and then exit normally. I thought this rare case wasn't worth the complexity of moving the bk API call within the jobapi and handling simultaneous callers.
Should the caller wait for the jobapi to make the API request? Does this cause simultaneous callers to be blocked until the jobapi has made the bk API call? |
56c552a to
62fe859
Compare
|
@buildsworth-bk review |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/808 |
There was a problem hiding this comment.
I didn’t find any high-confidence issues in the current revision. I’m leaving this as a comment rather than an approval because the change moves declaration behavior into the Job API/agent path, so it isn’t an L1 review.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 808, then answer my questions about the findings.
9a2ce3b to
a404178
Compare
Add a POST /api/current-job/v0/promise-failure endpoint that lets the 'job promise-failure' command debounce repeated calls. The server keeps a per-job set of exit statuses already claimed; the first caller to claim an exit status receives claimed=true and is responsible for declaring the promised failure to the Buildkite API, while later callers receive claimed=false so repeated calls (e.g. on every test failure) don't hammer the API. The Job API is job-scoped and enabled by default, so it needs no extra state keyed by job ID and works without opting into an experiment.
Before declaring a promised failure to the Buildkite API, claim the exit status through the Job API. Only the first caller for a given exit status contacts the API; later callers log at debug level and exit successfully. When the Job API is unavailable (--no-job-api, unsupported Windows, or run outside a job) the command falls back to declaring directly. Drop the --job flag and always target BUILDKITE_JOB_ID. The API authenticates with the job's own token and rejects declarations for any other job, so a promised failure only ever makes sense for the current job; a --job override could never succeed.
Per review feedback, move the Buildkite API call into the Job API server so it owns declaring promised failures. The first caller declares the exit status, blocking on the API so it can return an accurate result; concurrent callers wait and share that outcome; later callers return from the cached success. Failures aren't cached, so a later call can retry. This replaces the earlier 'claim' approach where the CLI made the API call itself after claiming an exit status, which couldn't give a blocked caller an accurate exit code.
The /promise-failure response now carries an outcome ('declared' or
'debounced') so callers can tell whether their call reached the Buildkite
API or shared an earlier declaration. The CLI logs the distinction.
Previously every failed declaration was evicted so a later call could retry. That meant terminal client errors (e.g. 409 when a different exit status was already declared, or 422 when the job is no longer running) were re-sent to the Buildkite API on every call — the repeated load this feature exists to avoid. Cache terminal failures (4xx other than 429) alongside successes, so repeated calls return the cached result. Transient failures (5xx, network, 429) are still evicted so a later call can retry.
Extract the shared PromiseFailure retry loop into api.Client.PromiseFailureWithRetry, used by both the executor (Job API path) and the CLI direct fallback. This removes the duplicated roko loop and lets the fallback read the HTTP status directly instead of digging it out of an ErrorResponse. A warnf callback bridges the two callers' logger interfaces (shell.Logger.Warningf vs logger.Logger.Warnf). Drop TestPromiseFailureConcurrentError (coalescing is covered by TestPromiseFailureConcurrent, transient retry by the not-cached test) and TestPromiseFailurePanicRecovers, folding its panic-eviction-then-retry assertion into TestPromiseFailurePanicConcurrent.
Rename the handler method to handlePromiseFailure so it no longer shares a name with the promiseFailure struct type, and document that debouncing keys on exit status only, so the first caller's reason wins for a given exit status.
a404178 to
cca7976
Compare
|
@buildsworth-bk review |
|
On it! Follow along with my review here: https://buildkite.com/buildkite/buildsworth-review-pull-request/builds/831 |
There was a problem hiding this comment.
I didn't find any high-confidence issues in the current revision. I'm leaving this as a comment rather than an approval because this changes Job API/CLI behavior, so it isn't an L1 review.
Want to dig deeper? The full session log is attached to this Buildkite build. Download the session file and open a new pi session with it:
Download the buildsworth logs from build 831, then answer my questions about the findings.
Repeated calls to job promise-failure with the same exit status are debounced. Log that outcome at debug instead of info to avoid spamming job logs.
Description
Debounce repeated
buildkite-agent job promise-failurecalls in the agent so each exit status results in at most one Buildkite API call per job.Customers are likely to call
promise-failurefrom their test suite — potentially on every failing test. Each call spawns a newbuildkite-agentprocess and would otherwise make its own API request, all of which the server must authenticate and process before discovering there's nothing new to record. For a suite with thousands of failures that's a lot of useless load on the Agent API.This change moves the debouncing into the agent. The Job API server owns the declaration: each
promise-failureinvocation POSTs its exit status to the Job API, and the server declares the failure to the Buildkite API at most once per exit status. The first caller for a given exit status declares it, blocking on the API so it can return an accurate result; concurrent callers for the same exit status wait and share that outcome; later callers return from the cached success. Terminal failures (4xx other than 429, e.g. 409 or 422) are cached too, so repeated calls don't keep re-sending a declaration the API will always reject; transient failures (5xx, network, 429) aren't cached, so a later call can retry. Because the caller blocks until the result is known, the command exits with an accurate code rather than guessing.The response reports an
outcomeofdeclared(this call reached the Buildkite API) ordebounced(it shared an earlier declaration), and the CLI logs the distinction.The Job API is job-scoped and enabled by default, so the state is naturally per-job (no keying by job ID) and there's no experiment to opt into. When the Job API is unavailable (
--no-job-api, unsupported Windows) or can't be reached, the command falls back to declaring directly; the endpoint is idempotent for the same exit status, so a duplicate is safe.This PR also drops the
--jobflag added to the command in #3992. The endpoint requires the job's own token and rejects declarations for any other job, so a promised failure only ever makes sense for the current job — the command now always targetsBUILDKITE_JOB_ID.Context
Follows on from #3992, which added the
buildkite-agent job promise-failurecommand.Changes
jobapi: thePOST /api/current-job/v0/promise-failureendpoint now declares the promised failure to the Buildkite API itself, coalescing concurrent and repeated calls so each exit status is declared at most once successfully. The declarer is injected viaWithPromiseFailureDeclarer. The response carries anoutcome(declared/debounced).clicommand: declare the promised failure through the Job API (blocking for an accurate result) and log whether it was declared or debounced; fall back to declaring directly when the Job API is unavailable or unreachable. Drop the--jobflag and always target the current job.Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go tool gofumpt -extra -w .)Added server tests over a real socket — sequential (first call declares, repeat is debounced, a different status declares on its own), concurrent (many simultaneous calls of the same status yield exactly one API declaration, all callers see success), error handling (terminal failures are cached, transient failures aren't and can retry; status normalization to 502), and panic safety — plus a client test for
DeclarePromiseFailure.Disclosures / Credits
I used Amp to design and implement the debouncing and tests, with GPT-5 for a code review pass. I reviewed all of the changes myself.