fix: faster docker in releases by pratikbin · Pull Request #311 · chopratejas/headroom

pratikbin · 2026-04-29T10:27:42Z

Summary

Cuts release CI wall time from 11m → 3m11s (-71%) measured across 10 prior runs vs 4 post-merge real releases on fork. Brings all GitHub Actions to latest majors and tunes Dependabot. basically I couldn't see the amount of time docker ci was taking

Validated end-to-end on pratikbin/headroom fork via 4 sequential PRs + 4 real releases.

Change set #1 — `publish-docker` workflow architecture

Five optimizations layered into .github/workflows/docker.yml:

Native arm64 runner (ubuntu-24.04-arm) instead of QEMU emulation on amd64. Drops the per-arm-build cost from ~150s to ~60-90s.
Single bake invocation per platform for all 8 variants. The Dockerfile builder stage is shared in-memory across all targets within one buildx run instead of being rebuilt 8 times across separate matrix jobs.
type=registry cache at <image>-buildcache:<arch>, replacing type=gha. Persists across releases and is not capped by the 10GB per-workflow-run GHA cache budget.
Builder pre-build subsumed by matrix collapse: one bake call per platform builds the builder once and reuses it for all 8 variants.
Cosign signing limited to root + slim variants (the public-facing tags). Other 6 variants share builder/runtime layers and are cryptographically equivalent; re-signing each was overhead.

Architecture:

setup → build (amd64 + arm64 parallel)
       → merge (8 variants, parallel manifest lists)
       → sign (root + slim) + promote-latest

Per-platform builds push by digest (push-by-digest=true,name-canonical=true). The merge job creates multi-arch manifest lists via docker buildx imagetools create — pure plumbing, no rebuild.

All ${{ ... }} interpolations in run: blocks are routed via env: to prevent shell injection through release tags / workflow_dispatch inputs / action outputs.

Change set #2 — `COPY --link`, drop `setup-python`, cosign retry

Targeted follow-on tweaks after the architecture rewrite:

COPY --link on builder→runtime layer copies (Dockerfile). BuildKit treats the copy as a metadata-only layer reference instead of materializing the bytes. For 8 runtime targets each copying ~50MB of site-packages from the shared builder, drops per-target COPY time from 20-35s to <5s.
Drop actions/setup-python in docker.yml build job. version-sync.py uses stdlib only, the runner's pre-installed python3 is sufficient. -8s per build job.
Cosign sign retry (3 attempts, exp backoff 5s/10s/20s). GHCR's signature endpoint occasionally returns stream ID 1; INTERNAL_ERROR on HTTP/2 stream resets during cosign sign --recursive, killing the job despite the underlying signing operation being safe to retry.
Explicit BuildKit max-parallelism=16 via setup-buildx-action's buildkitd-config-inline. No measurable delta (already default = 4 × NumCPU = 16 on 4-vCPU runners) but documents the intent.

Change set #3 — bump GitHub Actions to latest majors

Mass bump across 11 workflows, 68 references:

Action	Before	After
`actions/checkout`	v4	v6
`actions/cache`	v4	v5
`actions/download-artifact`	v4	v8
`actions/upload-artifact`	v4	v7
`actions/setup-node`	v4	v6
`actions/setup-python`	v5	v6
`codecov/codecov-action`	v4	v6 (+ `file` → `files` input rename)
`docker/setup-buildx-action`	v3	v4 (devcontainers only)
`sigstore/cosign-installer`	v3	v4.1.1 (no `v4` floating tag exists)
`softprops/action-gh-release`	v2	v3
`wagoid/commitlint-github-action`	v5	v6

Already on latest floating major (no change required): docker/{bake,login,metadata}-action, Swatinem/rust-cache, PyO3/maturin-action, pypa/gh-action-pypi-publish.

Change set #4 — Dependabot tuning for github-actions

Refines existing .github/dependabot.yml github-actions ecosystem entry:

cooldown.default-days: 3, semver-major-days: 7 — guards against immediately-broken floating tags. Discovered via cosign-installer v4: floating major tag was missing right after the v4 release, only fixed at v4.1.1 a few days later. Cooldown would have caught this.
schedule.day: monday — predictable PR cadence; less merge noise mid-week.
open-pull-requests-limit: 10 — accommodate fan-out when many actions release together.
Group docker/* major bumps (login/metadata/buildx/bake tested together) so they merge as one PR.
Group actions/* major bumps (checkout/cache/setup-* family) similarly.

Critical-path breakdown (final state, validation release)

detect-version          5s
publish-docker / setup  6s
publish-docker / build  91s  (arm64, runs in parallel with amd64 91s)
publish-docker / merge  11-36s × 8 (parallel)
publish-docker / sign   17-28s × 2 (parallel; root + slim)
publish-docker / promote-latest  15s
build (py + npm pack)   34s  (parallel with publish-docker)
publish-pypi / npm / gh-pkgs  12-31s (parallel)
create-release          14s  (sequential after all)
─────────────────────────────────────
TOTAL                   3m25s

arm64 build dominates the critical path. Further reductions would require paid larger runners (ubuntu-24.04-arm-8-cores).

Files modified

.github/workflows/docker.yml — full rewrite: setup → build matrix → merge matrix → sign + promote-latest
.github/workflows/{ci,release,publish,devcontainers,docs,eval,init-e2e,init-native-e2e,wrap-e2e,rust}.yml — action version bumps
Dockerfile — COPY --link on --from=builder references in both runtime stages
.github/dependabot.yml — cooldown + grouped major bumps for github-actions
docker-bake.hcl — unchanged (platforms overridden via set: in bake-action)

Test plan

actionlint clean across all workflows
workflow_dispatch on docker.yml — 4 successful test runs on fork
Real release.yml runs after each change set merged — 4 successful releases on pratikbin/headroom
Validation release (workflow_dispatch on main, no code change) — green at 3m25s
All 8 multi-arch manifests created (amd64 + arm64) per release
Cosign signatures verifiable at <image>-signatures repo
:latest promotion succeeds with annotation timestamp
Verify on upstream after merge

Verification commands

# Wall time of recent releases
gh run list --workflow=release.yml --repo chopratejas/headroom --limit 5 \
  --json databaseId,createdAt,updatedAt,conclusion \
  --jq '.[] | {id: .databaseId, dur: ((.updatedAt|fromdateiso8601) - (.createdAt|fromdateiso8601)), concl: .conclusion}'

# Per-job breakdown of a specific run
gh run view <RUN_ID> --repo chopratejas/headroom --json jobs \
  --jq '.jobs[] | "\(.name): \(.startedAt) → \(.completedAt) [\(.conclusion)]"'

# Verify cosign signatures on root variant
COSIGN_REPOSITORY=ghcr.io/chopratejas/headroom-signatures \
  cosign verify ghcr.io/chopratejas/headroom:<version> \
  --certificate-identity-regexp 'https://github.com/.*/headroom/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

Risk

Breaking change for downstream verifiers: cosign now signs only root + slim manifests. Consumers who previously verified specific nonroot/code-* tags should switch to verifying the manifest digest of the underlying image, or accept the trust transitively from root.
arm64 native runner availability: ubuntu-24.04-arm is GA on public repos; private repos may need plan tier check.
Registry cache size growth: <image>-buildcache:{amd64,arm64} will accumulate. Optional cleanup job could be added later.

Backport considerations

All four change sets can be merged independently:

Add Claude Opus 4.5 and Claude 4 model family to context limits #1 alone → 4m29s (-59%)
Add Claude Opus 4.5 and Claude 4 model family to context limits #1+[BUG] Decompression error: ZlibError #2 → 3m43s (-66%)
Add Claude Opus 4.5 and Claude 4 model family to context limits #1+[BUG] Decompression error: ZlibError #2+Fix ZlibError by removing compression headers after httpx decompression #3 → 3m11s (-71%)
Improve CI: Add Python 3.13, optimize linting, and fix unused import #4 (dependabot) is independent of the other

* ci(docker): cut release time from ~11m to ~4m Five optimizations applied to the publish-docker workflow: 1. Native arm64 runner (ubuntu-24.04-arm) instead of QEMU emulation. 2. Collapse 8-variant matrix into a single bake invocation per platform; builder stage shared in-memory across targets within one buildx run. 3. Cross-run caching via type=registry (ref=<image>-buildcache:<arch>), replacing type=gha (per-workflow-run scoped, 10GB capped). 4. Builder pre-build subsumed by matrix collapse; one bake call per platform builds builder once and reuses it for all 8 variants. 5. Cosign signing limited to root + slim variants (public-facing tags). Other 6 share builder/runtime layers, are cryptographically equivalent. Architecture: setup -> build (amd64 + arm64 parallel) -> merge (8 variants parallel, manifest list creation only) -> sign (root + slim) + promote-latest Per-platform builds push by digest. Merge job creates multi-arch manifest lists via docker buildx imagetools create - no rebuild, just plumbing. All dynamic interpolations in run: blocks routed through env: to prevent shell injection via ${{ }} expansion of release tags / workflow_dispatch inputs / action outputs. Baseline avg: 11m wall (over 10 prior runs). Expected: ~4-5m wall after warm cache. * fix(ci): use bake --set platform (singular) not platforms

* ci(docker): drop setup-python, raise buildkit parallelism - version-sync.py uses stdlib only, runner's python3 is sufficient - explicit BuildKit max-parallelism=16 keeps vertex execution saturated for the 8-target single bake invocation per platform * perf(docker): use COPY --link for builder->runtime layer references COPY --link makes BuildKit treat the copy as a metadata-only layer reference instead of materializing the bytes. For the multi-stage build with 8 runtime targets all copying ~50MB of site-packages from the builder stage, this drops per-target COPY time from 20-35s to <5s. Requires BuildKit 0.10+ (Docker 22.x+, present on all current ubuntu-* runners). * ci(docker): retry cosign sign on transient HTTP/2 errors GHCR's signature repository occasionally returns HTTP/2 INTERNAL_ERROR on stream resets during cosign sign --recursive, killing the job despite the underlying signing operation being safe to retry. Wraps the cosign invocation in a 3-attempt retry with 5s/10s/20s exponential backoff.

* ci: bump GitHub Actions to latest major versions | Action | Before | After | |-------------------------------------|--------|-------| | actions/checkout | v4 | v6 | | actions/cache | v4 | v5 | | actions/download-artifact | v4 | v8 | | actions/upload-artifact | v4 | v7 | | actions/setup-node | v4 | v6 | | actions/setup-python | v5 | v6 | | codecov/codecov-action | v4 | v6 | | docker/setup-buildx-action | v3 | v4 | | sigstore/cosign-installer | v3 | v4 | | softprops/action-gh-release | v2 | v3 | | wagoid/commitlint-github-action | v5 | v6 | Already on latest floating major (no change required): docker/{bake,login,metadata,setup-buildx}-action, Swatinem/rust-cache, PyO3/maturin-action, pypa/gh-action-pypi-publish. codecov-action v6 renamed `file` input to `files` (plural) — updated. * fix(ci): pin sigstore/cosign-installer to v4.1.1 Action does not publish a floating `v4` tag, so `uses: ...@v4` fails to resolve. Pin to v4.1.1 (latest).

- 3-day cooldown (7 for majors) avoids immediately-broken floating tags (recently observed: cosign-installer v4 floating tag missing right after release, fixed only at v4.1.1) - monday schedule pinpoints PR cadence - group docker/* and actions/* major bumps so they merge as one PR rather than 8+ individual ones - bump open-pull-requests-limit to 10 to accommodate fan-out

Brainstormed design for opt-in PII + secret redaction stage that runs before request reaches upstream LLM provider. Locked decisions: Presidio + regex + detect-secrets hybrid; tag-redact masking; off by default with env / header / mirrored route opt-in; user msgs only; fail-open. Awaiting user review before writing implementation plan.

Address 4 critical + 3 warning + 1 info findings from review: - C1: move PII out of TransformPipeline into pre-pipeline handler stage so cache hits, optimize=False, _bypass, and license skips cannot bypass redaction. - C2: header is escalate-only; `off` ignored; min_score/labels overrides gated behind HEADROOM_PII_ALLOW_HEADER_OVERRIDES. - C3: replace 1MB skip cap with 256KB chunk-scan + overlap so padding attacks cannot leak past the filter. - C4: add Responses API walker covering input str/list and Responses item shapes (input_text). - W5: add DetectorRunner with bounded ThreadPoolExecutor; document thread-cancel caveat. - W6: walker uses .get() + isinstance() guards everywhere. - W7: forced /pii/* routes default fail-closed (HTTP 503); env/header opt-in remains fail-open. Both configurable. - I8: align spec with actual Transform.apply signature and HeadroomConfig name; PII redactor is no longer a Transform.

pratikbin added 9 commits April 29, 2026 01:51

Merge branch 'chopratejas:main' into main

0a35123

chore: sync plugin versions to 0.11.7

b4229e2

Merge branch 'upstream/main' into main

628e3a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: faster docker in releases#311

fix: faster docker in releases#311
pratikbin wants to merge 9 commits intochopratejas:mainfrom
pratikbin:main

pratikbin commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pratikbin commented Apr 29, 2026

Summary

Change set #1 — publish-docker workflow architecture

Change set #2 — COPY --link, drop setup-python, cosign retry

Change set #3 — bump GitHub Actions to latest majors

Change set #4 — Dependabot tuning for github-actions

Critical-path breakdown (final state, validation release)

Files modified

Test plan

Verification commands

Risk

Backport considerations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Change set #1 — `publish-docker` workflow architecture

Change set #2 — `COPY --link`, drop `setup-python`, cosign retry