Skip to content

chore(infra): extend disk-guard to cover bind-mount target roots#1026

Open
noahgift wants to merge 1 commit intomainfrom
chore/disk-guard-bind-mount-coverage
Open

chore(infra): extend disk-guard to cover bind-mount target roots#1026
noahgift wants to merge 1 commit intomainfrom
chore/disk-guard-bind-mount-coverage

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Root cause timeline

Fix

New helper `prune_bind_mount_target_roots` walks each root in `$BIND_MOUNT_ROOTS` (default `/mnt/nvme-raid0/targets/aprender-ci`):

  • Always removes `debug/` subdir (orphan from pre-isolation; no current workflow mounts it)
  • In nightly mode, removes PR# subdirs older than `STALE_DAYS` days
  • In pre-job mode, removes PR# subdirs older than 60 min (aggressive disk-recovery) — fresh in-flight dirs survive
  • Always preserves `main` subdir (push-to-main CI reuses it)

Space-separated `BIND_MOUNT_ROOTS` env var lets sibling fleets (sovereign-ci-paiml-mcp-agent-toolkit etc.) extend coverage via config only.

Deployment

Deployed to intel 2026-04-23T12:58Z alongside the PR #1001 version update (intel was still on an older build). Nightly dry-run confirmed no unexpected candidates under the new path after manual cleanup.

```
$ sudo md5sum /usr/local/bin/runner-disk-guard.sh
921e055c55a2c8f1838aac6809d60840 /usr/local/bin/runner-disk-guard.sh
$ md5sum scripts/runner-infra/runner-disk-guard.sh
921e055c55a2c8f1838aac6809d60840 scripts/runner-infra/runner-disk-guard.sh
```

Test plan

  • `bash -n` syntax-check passes
  • Manual nightly dry-run on intel emits expected log lines ("nightly: / at 61% …") with no unintended prunes
  • CI must pass (`ci / gate` + `workspace-test`)
  • Next full-disk recovery cycle should keep intel online without manual intervention

🤖 Generated with Claude Code

The disk-guard added in #1001 walked only /home/noah/data/actions-runner*/_work/*/target/
— runner-workspace target dirs totalling ~75G across 8 runners. The actual runner-disk-
fill source that took intel offline on 2026-04-23 was /mnt/nvme-raid0/targets/aprender-ci/*:
per-PR bind-mount target dirs from ci.yml's task-#134 isolation, holding 1.9T including a
359G orphan `debug/` dir from pre-isolation era. Disk-guard never touched them.

Adds new BIND_MOUNT_ROOTS (default `/mnt/nvme-raid0/targets/aprender-ci`) and a
prune_bind_mount_target_roots() helper:

- Always prunes `debug/` subdir (orphan, no current workflow bind-mounts it).
- Prunes PR# subdirs stale past a minute threshold (nightly: STALE_DAYS×24×60 min;
  pre-job: 60-min floor so fresh in-flight dirs survive full-disk recovery).
- Preserves `main` (push-to-main CI reuses it).

Space-separated BIND_MOUNT_ROOTS env var lets the same script cover sibling fleets
(sovereign-ci-paiml-mcp-agent-toolkit etc.) without code changes.

Deployed to intel 2026-04-23T12:58Z; nightly dry-run confirmed no unexpected prune
candidates under the new path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 23, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant