Skip to content

fix: faster docker in releases#311

Open
pratikbin wants to merge 9 commits intochopratejas:mainfrom
pratikbin:main
Open

fix: faster docker in releases#311
pratikbin wants to merge 9 commits intochopratejas:mainfrom
pratikbin:main

Conversation

@pratikbin
Copy link
Copy Markdown
Contributor

Summary

Cuts release CI wall time from 11m → 3m11s (-71%) measured across 10 prior runs vs 4 post-merge real releases on fork. Brings all GitHub Actions to latest majors and tunes Dependabot. basically I couldn't see the amount of time docker ci was taking

Validated end-to-end on pratikbin/headroom fork via 4 sequential PRs + 4 real releases.

Change set #1publish-docker workflow architecture

Five optimizations layered into .github/workflows/docker.yml:

  1. Native arm64 runner (ubuntu-24.04-arm) instead of QEMU emulation on amd64. Drops the per-arm-build cost from ~150s to ~60-90s.
  2. Single bake invocation per platform for all 8 variants. The Dockerfile builder stage is shared in-memory across all targets within one buildx run instead of being rebuilt 8 times across separate matrix jobs.
  3. type=registry cache at <image>-buildcache:<arch>, replacing type=gha. Persists across releases and is not capped by the 10GB per-workflow-run GHA cache budget.
  4. Builder pre-build subsumed by matrix collapse: one bake call per platform builds the builder once and reuses it for all 8 variants.
  5. Cosign signing limited to root + slim variants (the public-facing tags). Other 6 variants share builder/runtime layers and are cryptographically equivalent; re-signing each was overhead.

Architecture:

setup → build (amd64 + arm64 parallel)
       → merge (8 variants, parallel manifest lists)
       → sign (root + slim) + promote-latest

Per-platform builds push by digest (push-by-digest=true,name-canonical=true). The merge job creates multi-arch manifest lists via docker buildx imagetools create — pure plumbing, no rebuild.

All ${{ ... }} interpolations in run: blocks are routed via env: to prevent shell injection through release tags / workflow_dispatch inputs / action outputs.


Change set #2COPY --link, drop setup-python, cosign retry

Targeted follow-on tweaks after the architecture rewrite:

  • COPY --link on builder→runtime layer copies (Dockerfile). BuildKit treats the copy as a metadata-only layer reference instead of materializing the bytes. For 8 runtime targets each copying ~50MB of site-packages from the shared builder, drops per-target COPY time from 20-35s to <5s.
  • Drop actions/setup-python in docker.yml build job. version-sync.py uses stdlib only, the runner's pre-installed python3 is sufficient. -8s per build job.
  • Cosign sign retry (3 attempts, exp backoff 5s/10s/20s). GHCR's signature endpoint occasionally returns stream ID 1; INTERNAL_ERROR on HTTP/2 stream resets during cosign sign --recursive, killing the job despite the underlying signing operation being safe to retry.
  • Explicit BuildKit max-parallelism=16 via setup-buildx-action's buildkitd-config-inline. No measurable delta (already default = 4 × NumCPU = 16 on 4-vCPU runners) but documents the intent.

Change set #3 — bump GitHub Actions to latest majors

Mass bump across 11 workflows, 68 references:

Action Before After
actions/checkout v4 v6
actions/cache v4 v5
actions/download-artifact v4 v8
actions/upload-artifact v4 v7
actions/setup-node v4 v6
actions/setup-python v5 v6
codecov/codecov-action v4 v6 (+ filefiles input rename)
docker/setup-buildx-action v3 v4 (devcontainers only)
sigstore/cosign-installer v3 v4.1.1 (no v4 floating tag exists)
softprops/action-gh-release v2 v3
wagoid/commitlint-github-action v5 v6

Already on latest floating major (no change required): docker/{bake,login,metadata}-action, Swatinem/rust-cache, PyO3/maturin-action, pypa/gh-action-pypi-publish.


Change set #4 — Dependabot tuning for github-actions

Refines existing .github/dependabot.yml github-actions ecosystem entry:

  • cooldown.default-days: 3, semver-major-days: 7 — guards against immediately-broken floating tags. Discovered via cosign-installer v4: floating major tag was missing right after the v4 release, only fixed at v4.1.1 a few days later. Cooldown would have caught this.
  • schedule.day: monday — predictable PR cadence; less merge noise mid-week.
  • open-pull-requests-limit: 10 — accommodate fan-out when many actions release together.
  • Group docker/* major bumps (login/metadata/buildx/bake tested together) so they merge as one PR.
  • Group actions/* major bumps (checkout/cache/setup-* family) similarly.

Critical-path breakdown (final state, validation release)

detect-version          5s
publish-docker / setup  6s
publish-docker / build  91s  (arm64, runs in parallel with amd64 91s)
publish-docker / merge  11-36s × 8 (parallel)
publish-docker / sign   17-28s × 2 (parallel; root + slim)
publish-docker / promote-latest  15s
build (py + npm pack)   34s  (parallel with publish-docker)
publish-pypi / npm / gh-pkgs  12-31s (parallel)
create-release          14s  (sequential after all)
─────────────────────────────────────
TOTAL                   3m25s

arm64 build dominates the critical path. Further reductions would require paid larger runners (ubuntu-24.04-arm-8-cores).


Files modified

  • .github/workflows/docker.yml — full rewrite: setup → build matrix → merge matrix → sign + promote-latest
  • .github/workflows/{ci,release,publish,devcontainers,docs,eval,init-e2e,init-native-e2e,wrap-e2e,rust}.yml — action version bumps
  • DockerfileCOPY --link on --from=builder references in both runtime stages
  • .github/dependabot.yml — cooldown + grouped major bumps for github-actions
  • docker-bake.hcl — unchanged (platforms overridden via set: in bake-action)

Test plan

  • actionlint clean across all workflows
  • workflow_dispatch on docker.yml — 4 successful test runs on fork
  • Real release.yml runs after each change set merged — 4 successful releases on pratikbin/headroom
  • Validation release (workflow_dispatch on main, no code change) — green at 3m25s
  • All 8 multi-arch manifests created (amd64 + arm64) per release
  • Cosign signatures verifiable at <image>-signatures repo
  • :latest promotion succeeds with annotation timestamp
  • Verify on upstream after merge

Verification commands

# Wall time of recent releases
gh run list --workflow=release.yml --repo chopratejas/headroom --limit 5 \
  --json databaseId,createdAt,updatedAt,conclusion \
  --jq '.[] | {id: .databaseId, dur: ((.updatedAt|fromdateiso8601) - (.createdAt|fromdateiso8601)), concl: .conclusion}'

# Per-job breakdown of a specific run
gh run view <RUN_ID> --repo chopratejas/headroom --json jobs \
  --jq '.jobs[] | "\(.name): \(.startedAt) → \(.completedAt) [\(.conclusion)]"'

# Verify cosign signatures on root variant
COSIGN_REPOSITORY=ghcr.io/chopratejas/headroom-signatures \
  cosign verify ghcr.io/chopratejas/headroom:<version> \
  --certificate-identity-regexp 'https://github.com/.*/headroom/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com

Risk

  • Breaking change for downstream verifiers: cosign now signs only root + slim manifests. Consumers who previously verified specific nonroot/code-* tags should switch to verifying the manifest digest of the underlying image, or accept the trust transitively from root.
  • arm64 native runner availability: ubuntu-24.04-arm is GA on public repos; private repos may need plan tier check.
  • Registry cache size growth: <image>-buildcache:{amd64,arm64} will accumulate. Optional cleanup job could be added later.

Backport considerations

All four change sets can be merged independently:

* ci(docker): cut release time from ~11m to ~4m

Five optimizations applied to the publish-docker workflow:

1. Native arm64 runner (ubuntu-24.04-arm) instead of QEMU emulation.
2. Collapse 8-variant matrix into a single bake invocation per platform;
   builder stage shared in-memory across targets within one buildx run.
3. Cross-run caching via type=registry (ref=<image>-buildcache:<arch>),
   replacing type=gha (per-workflow-run scoped, 10GB capped).
4. Builder pre-build subsumed by matrix collapse; one bake call per
   platform builds builder once and reuses it for all 8 variants.
5. Cosign signing limited to root + slim variants (public-facing tags).
   Other 6 share builder/runtime layers, are cryptographically equivalent.

Architecture:
  setup -> build (amd64 + arm64 parallel) -> merge (8 variants parallel,
           manifest list creation only) -> sign (root + slim) +
           promote-latest

Per-platform builds push by digest. Merge job creates multi-arch manifest
lists via docker buildx imagetools create - no rebuild, just plumbing.

All dynamic interpolations in run: blocks routed through env: to prevent
shell injection via ${{ }} expansion of release tags / workflow_dispatch
inputs / action outputs.

Baseline avg: 11m wall (over 10 prior runs).
Expected: ~4-5m wall after warm cache.

* fix(ci): use bake --set platform (singular) not platforms
* ci(docker): drop setup-python, raise buildkit parallelism

- version-sync.py uses stdlib only, runner's python3 is sufficient
- explicit BuildKit max-parallelism=16 keeps vertex execution saturated
  for the 8-target single bake invocation per platform

* perf(docker): use COPY --link for builder->runtime layer references

COPY --link makes BuildKit treat the copy as a metadata-only layer
reference instead of materializing the bytes. For the multi-stage
build with 8 runtime targets all copying ~50MB of site-packages
from the builder stage, this drops per-target COPY time from
20-35s to <5s.

Requires BuildKit 0.10+ (Docker 22.x+, present on all current
ubuntu-* runners).

* ci(docker): retry cosign sign on transient HTTP/2 errors

GHCR's signature repository occasionally returns HTTP/2 INTERNAL_ERROR
on stream resets during cosign sign --recursive, killing the job
despite the underlying signing operation being safe to retry.

Wraps the cosign invocation in a 3-attempt retry with 5s/10s/20s
exponential backoff.
* ci: bump GitHub Actions to latest major versions

| Action                              | Before | After |
|-------------------------------------|--------|-------|
| actions/checkout                    | v4     | v6    |
| actions/cache                       | v4     | v5    |
| actions/download-artifact           | v4     | v8    |
| actions/upload-artifact             | v4     | v7    |
| actions/setup-node                  | v4     | v6    |
| actions/setup-python                | v5     | v6    |
| codecov/codecov-action              | v4     | v6    |
| docker/setup-buildx-action          | v3     | v4    |
| sigstore/cosign-installer           | v3     | v4    |
| softprops/action-gh-release         | v2     | v3    |
| wagoid/commitlint-github-action     | v5     | v6    |

Already on latest floating major (no change required):
docker/{bake,login,metadata,setup-buildx}-action, Swatinem/rust-cache,
PyO3/maturin-action, pypa/gh-action-pypi-publish.

codecov-action v6 renamed `file` input to `files` (plural) — updated.

* fix(ci): pin sigstore/cosign-installer to v4.1.1

Action does not publish a floating `v4` tag, so `uses: ...@v4`
fails to resolve. Pin to v4.1.1 (latest).
- 3-day cooldown (7 for majors) avoids immediately-broken floating tags
  (recently observed: cosign-installer v4 floating tag missing right
  after release, fixed only at v4.1.1)
- monday schedule pinpoints PR cadence
- group docker/* and actions/* major bumps so they merge as one PR
  rather than 8+ individual ones
- bump open-pull-requests-limit to 10 to accommodate fan-out
Brainstormed design for opt-in PII + secret redaction stage that
runs before request reaches upstream LLM provider. Locked decisions:
Presidio + regex + detect-secrets hybrid; tag-redact masking; off by
default with env / header / mirrored route opt-in; user msgs only;
fail-open. Awaiting user review before writing implementation plan.
Address 4 critical + 3 warning + 1 info findings from review:

- C1: move PII out of TransformPipeline into pre-pipeline handler
  stage so cache hits, optimize=False, _bypass, and license skips
  cannot bypass redaction.
- C2: header is escalate-only; `off` ignored; min_score/labels
  overrides gated behind HEADROOM_PII_ALLOW_HEADER_OVERRIDES.
- C3: replace 1MB skip cap with 256KB chunk-scan + overlap so
  padding attacks cannot leak past the filter.
- C4: add Responses API walker covering input str/list and
  Responses item shapes (input_text).
- W5: add DetectorRunner with bounded ThreadPoolExecutor; document
  thread-cancel caveat.
- W6: walker uses .get() + isinstance() guards everywhere.
- W7: forced /pii/* routes default fail-closed (HTTP 503);
  env/header opt-in remains fail-open. Both configurable.
- I8: align spec with actual Transform.apply signature and
  HeadroomConfig name; PII redactor is no longer a Transform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant