fix: faster docker in releases#311
Open
pratikbin wants to merge 9 commits intochopratejas:mainfrom
Open
Conversation
* ci(docker): cut release time from ~11m to ~4m
Five optimizations applied to the publish-docker workflow:
1. Native arm64 runner (ubuntu-24.04-arm) instead of QEMU emulation.
2. Collapse 8-variant matrix into a single bake invocation per platform;
builder stage shared in-memory across targets within one buildx run.
3. Cross-run caching via type=registry (ref=<image>-buildcache:<arch>),
replacing type=gha (per-workflow-run scoped, 10GB capped).
4. Builder pre-build subsumed by matrix collapse; one bake call per
platform builds builder once and reuses it for all 8 variants.
5. Cosign signing limited to root + slim variants (public-facing tags).
Other 6 share builder/runtime layers, are cryptographically equivalent.
Architecture:
setup -> build (amd64 + arm64 parallel) -> merge (8 variants parallel,
manifest list creation only) -> sign (root + slim) +
promote-latest
Per-platform builds push by digest. Merge job creates multi-arch manifest
lists via docker buildx imagetools create - no rebuild, just plumbing.
All dynamic interpolations in run: blocks routed through env: to prevent
shell injection via ${{ }} expansion of release tags / workflow_dispatch
inputs / action outputs.
Baseline avg: 11m wall (over 10 prior runs).
Expected: ~4-5m wall after warm cache.
* fix(ci): use bake --set platform (singular) not platforms
* ci(docker): drop setup-python, raise buildkit parallelism - version-sync.py uses stdlib only, runner's python3 is sufficient - explicit BuildKit max-parallelism=16 keeps vertex execution saturated for the 8-target single bake invocation per platform * perf(docker): use COPY --link for builder->runtime layer references COPY --link makes BuildKit treat the copy as a metadata-only layer reference instead of materializing the bytes. For the multi-stage build with 8 runtime targets all copying ~50MB of site-packages from the builder stage, this drops per-target COPY time from 20-35s to <5s. Requires BuildKit 0.10+ (Docker 22.x+, present on all current ubuntu-* runners). * ci(docker): retry cosign sign on transient HTTP/2 errors GHCR's signature repository occasionally returns HTTP/2 INTERNAL_ERROR on stream resets during cosign sign --recursive, killing the job despite the underlying signing operation being safe to retry. Wraps the cosign invocation in a 3-attempt retry with 5s/10s/20s exponential backoff.
* ci: bump GitHub Actions to latest major versions
| Action | Before | After |
|-------------------------------------|--------|-------|
| actions/checkout | v4 | v6 |
| actions/cache | v4 | v5 |
| actions/download-artifact | v4 | v8 |
| actions/upload-artifact | v4 | v7 |
| actions/setup-node | v4 | v6 |
| actions/setup-python | v5 | v6 |
| codecov/codecov-action | v4 | v6 |
| docker/setup-buildx-action | v3 | v4 |
| sigstore/cosign-installer | v3 | v4 |
| softprops/action-gh-release | v2 | v3 |
| wagoid/commitlint-github-action | v5 | v6 |
Already on latest floating major (no change required):
docker/{bake,login,metadata,setup-buildx}-action, Swatinem/rust-cache,
PyO3/maturin-action, pypa/gh-action-pypi-publish.
codecov-action v6 renamed `file` input to `files` (plural) — updated.
* fix(ci): pin sigstore/cosign-installer to v4.1.1
Action does not publish a floating `v4` tag, so `uses: ...@v4`
fails to resolve. Pin to v4.1.1 (latest).
- 3-day cooldown (7 for majors) avoids immediately-broken floating tags (recently observed: cosign-installer v4 floating tag missing right after release, fixed only at v4.1.1) - monday schedule pinpoints PR cadence - group docker/* and actions/* major bumps so they merge as one PR rather than 8+ individual ones - bump open-pull-requests-limit to 10 to accommodate fan-out
Brainstormed design for opt-in PII + secret redaction stage that runs before request reaches upstream LLM provider. Locked decisions: Presidio + regex + detect-secrets hybrid; tag-redact masking; off by default with env / header / mirrored route opt-in; user msgs only; fail-open. Awaiting user review before writing implementation plan.
Address 4 critical + 3 warning + 1 info findings from review: - C1: move PII out of TransformPipeline into pre-pipeline handler stage so cache hits, optimize=False, _bypass, and license skips cannot bypass redaction. - C2: header is escalate-only; `off` ignored; min_score/labels overrides gated behind HEADROOM_PII_ALLOW_HEADER_OVERRIDES. - C3: replace 1MB skip cap with 256KB chunk-scan + overlap so padding attacks cannot leak past the filter. - C4: add Responses API walker covering input str/list and Responses item shapes (input_text). - W5: add DetectorRunner with bounded ThreadPoolExecutor; document thread-cancel caveat. - W6: walker uses .get() + isinstance() guards everywhere. - W7: forced /pii/* routes default fail-closed (HTTP 503); env/header opt-in remains fail-open. Both configurable. - I8: align spec with actual Transform.apply signature and HeadroomConfig name; PII redactor is no longer a Transform.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cuts release CI wall time from 11m → 3m11s (-71%) measured across 10 prior runs vs 4 post-merge real releases on fork. Brings all GitHub Actions to latest majors and tunes Dependabot. basically I couldn't see the amount of time docker ci was taking
Validated end-to-end on
pratikbin/headroomfork via 4 sequential PRs + 4 real releases.Change set #1 —
publish-dockerworkflow architectureFive optimizations layered into
.github/workflows/docker.yml:ubuntu-24.04-arm) instead of QEMU emulation on amd64. Drops the per-arm-build cost from ~150s to ~60-90s.builderstage is shared in-memory across all targets within one buildx run instead of being rebuilt 8 times across separate matrix jobs.type=registrycache at<image>-buildcache:<arch>, replacingtype=gha. Persists across releases and is not capped by the 10GB per-workflow-run GHA cache budget.Architecture:
Per-platform builds push by digest (
push-by-digest=true,name-canonical=true). Themergejob creates multi-arch manifest lists viadocker buildx imagetools create— pure plumbing, no rebuild.All
${{ ... }}interpolations inrun:blocks are routed viaenv:to prevent shell injection through release tags / workflow_dispatch inputs / action outputs.Change set #2 —
COPY --link, dropsetup-python, cosign retryTargeted follow-on tweaks after the architecture rewrite:
COPY --linkon builder→runtime layer copies (Dockerfile). BuildKit treats the copy as a metadata-only layer reference instead of materializing the bytes. For 8 runtime targets each copying ~50MB ofsite-packagesfrom the shared builder, drops per-target COPY time from 20-35s to <5s.actions/setup-pythonindocker.ymlbuild job.version-sync.pyuses stdlib only, the runner's pre-installedpython3is sufficient. -8s per build job.stream ID 1; INTERNAL_ERRORon HTTP/2 stream resets duringcosign sign --recursive, killing the job despite the underlying signing operation being safe to retry.BuildKit max-parallelism=16viasetup-buildx-action'sbuildkitd-config-inline. No measurable delta (already default = 4 × NumCPU = 16 on 4-vCPU runners) but documents the intent.Change set #3 — bump GitHub Actions to latest majors
Mass bump across 11 workflows, 68 references:
actions/checkoutactions/cacheactions/download-artifactactions/upload-artifactactions/setup-nodeactions/setup-pythoncodecov/codecov-actionfile→filesinput rename)docker/setup-buildx-actionsigstore/cosign-installerv4floating tag exists)softprops/action-gh-releasewagoid/commitlint-github-actionAlready on latest floating major (no change required):
docker/{bake,login,metadata}-action,Swatinem/rust-cache,PyO3/maturin-action,pypa/gh-action-pypi-publish.Change set #4 — Dependabot tuning for github-actions
Refines existing
.github/dependabot.ymlgithub-actionsecosystem entry:cooldown.default-days: 3,semver-major-days: 7— guards against immediately-broken floating tags. Discovered viacosign-installer v4: floating major tag was missing right after the v4 release, only fixed at v4.1.1 a few days later. Cooldown would have caught this.schedule.day: monday— predictable PR cadence; less merge noise mid-week.open-pull-requests-limit: 10— accommodate fan-out when many actions release together.docker/*major bumps (login/metadata/buildx/bake tested together) so they merge as one PR.actions/*major bumps (checkout/cache/setup-* family) similarly.Critical-path breakdown (final state, validation release)
arm64 build dominates the critical path. Further reductions would require paid larger runners (
ubuntu-24.04-arm-8-cores).Files modified
.github/workflows/docker.yml— full rewrite: setup → build matrix → merge matrix → sign + promote-latest.github/workflows/{ci,release,publish,devcontainers,docs,eval,init-e2e,init-native-e2e,wrap-e2e,rust}.yml— action version bumpsDockerfile—COPY --linkon--from=builderreferences in both runtime stages.github/dependabot.yml— cooldown + grouped major bumps forgithub-actionsdocker-bake.hcl— unchanged (platforms overridden viaset:in bake-action)Test plan
actionlintclean across all workflowsworkflow_dispatchondocker.yml— 4 successful test runs on forkrelease.ymlruns after each change set merged — 4 successful releases onpratikbin/headroom<image>-signaturesrepo:latestpromotion succeeds with annotation timestampVerification commands
Risk
nonroot/code-*tags should switch to verifying the manifest digest of the underlying image, or accept the trust transitively from root.ubuntu-24.04-armis GA on public repos; private repos may need plan tier check.<image>-buildcache:{amd64,arm64}will accumulate. Optional cleanup job could be added later.Backport considerations
All four change sets can be merged independently: