Skip to content

multi data disk support#20

Merged
CMGS merged 18 commits intomasterfrom
feat/multi-data-disk
Apr 27, 2026
Merged

multi data disk support#20
CMGS merged 18 commits intomasterfrom
feat/multi-data-disk

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented Apr 27, 2026

No description provided.

CMGS added 18 commits April 27, 2026 13:37
Introduce explicit Role classification on StorageConfig so the rest of
the codebase can stop inferring disk type from RO/Serial/filename. Adds
MountPoint, FSType, DirectIO fields used by the upcoming user data-disk
feature, and a runtime validator that enforces Role-RO consistency,
data-disk fstype/mount/serial rules, and DirectIO scope.

VMConfig grows a transient DataDisks slice that the CLI populates from
--data-disk and Create consumes when calling PrepareDataDisks.

This is the type-layer scaffold; producers and classifiers switch in
follow-up commits.
Double-write phase: every site that constructs a StorageConfig now sets
Role alongside the existing RO/Serial fields. Classifiers still consume
RO/Serial; the next commit flips them to Role atomically.

Producers updated:
  - images/oci, images/cloudimg: base layers tagged Layer
  - hypervisor/utils.PrepareOCICOW: COW tagged COW
  - cloudhypervisor.prepareCloudimg: overlay COW, cidata Cidata
  - cloudhypervisor.ensureCloneCidata: appended cidata Cidata
  - cloudhypervisor.rebuildStorageConfigs (clone reconstruction): Role
    inferred from chDisk via roleFromCHDisk
  - firecracker.rebuildCloneStorage: layers Layer, COW

The CH/FC clone-time reconstructions need Role here (not in a later
commit) because once classifiers switch to Role, configs without Role
would silently misbehave on the clone path.
All disk classification now reads StorageConfig.Role instead of inferring
from RO, Serial==CowSerial, or basename matching. With the producer
double-write in place, this is a complete swap.

  - args.activeDisks: skips Role==Cidata; isCidataDisk deleted
  - helper.ReverseLayerSerials: filters Role==Layer
  - clone.updateCOWPath: matches Role==COW; the directBoot/non-directBoot
    branch collapses since Role is unambiguous
  - clone.updateCloneCidataPath, ensureCloneCidata: Role==Cidata
  - clone tests retargeted to Role
  - hypervisor.ExtractBlobIDs / VerifyBaseFiles: Role==Layer
  - firecracker.buildCmdline: nLayers via Role==Layer
  - firecracker.withSourceCOWLocked: Role==COW
  - firecracker.rebuildCloneStorage iterates by Role
  - firecracker.saveSnapshotMeta value-copies the StorageConfig so Role
    survives serialization (avoids losing newly added fields again)
  - cmd/vm/run.printPostCloneHints: cloudimg detection via ImageType
  - cmd/vm/debug.printFCDebug: nLayers via Role
User-facing flag --data-disk lands. Each value is parsed as
size=20G[,name=...][,fstype=ext4|none][,mount=/mnt/x][,directio=on|off|auto];
omitted name is filled with dataN, omitted mount with /mnt/<name>, omitted
fstype with ext4. fstype=none requires empty mount. Names cap at 20 chars
to match Linux's virtio by-id truncation, and the cocoon- prefix is
reserved.

hypervisor.PrepareDataDisks creates each <runDir>/data-<name>.raw via
OpenFile+Truncate (PrepareOCICOW pattern; os.Truncate would not create),
mkfs.ext4 reuses InitCOWFilesystem. fstype=none leaves the file unformatted.

Both backends append data disks after their writable rootfs disk:
  - CH OCI:        layers... + cow + data...
  - CH cloudimg:   overlay + data... + cidata
                   (cidata stays last so activeDisks() filtering after
                    first boot doesn't shift devPath of data disks)
  - FC OCI:        layers... + cow + data...

Snapshot/clone/restore still see only the new disks as raw files; the
follow-up commits make them roundtrip correctly.
CH writes a cocoon.json sidecar inside every snapshot tar that mirrors
config.json's disk shape — same order, same length — but carries the
cocoon-only fields (Role, MountPoint, FSType, DirectIO) that CH's schema
cannot.

Sidecar is built by parsing the snapshot's just-written config.json,
then looking each disk path up in rec.StorageConfigs. This deliberately
avoids activeDisks(rec): a cloudimg VM snapshotted between FirstBooted=true
and the next stop still has cidata attached in CH's view, so config.json
shows it; activeDisks(rec) would not, breaking the sidecar/config.json
length invariant patchCHConfig depends on.

Clone reads the sidecar, validates it through ValidateStorageConfigs
(may come from imported snapshots, untrusted), then proceeds with the
StorageConfigs already carrying full Role/MountPoint/FSType. The old
rebuildStorageConfigs helper and roleFromCHDisk Role-from-chDisk inference
are removed; the sidecar makes them unnecessary. The TestRebuildStorageConfigs
fixture is dropped along with them.

direct.cloneSnapshotFiles needs no change: cocoon.json falls into the
Meta classifier and is plain-copied into the clone's runDir.
Snapshot now reflinks each Role==Data disk into the staging dir during
the pause window (same window as COW) so the tar carries them and
streamed/direct clone places them at the new runDir. CH and FC follow
identical patterns; data disks are not subject to FirstBooted filtering,
so they're enumerated directly off rec.StorageConfigs rather than via
activeDisks().

cleanSnapshotFiles matchers in both backends' direct.go now include
data-*.raw so a restore (which goes kill-VM → clean → reflink) cannot
leave stale data disks from a previous incarnation alongside the freshly
extracted ones. CH also cleans the cocoon.json sidecar; FC already had
its sidecar in the matcher under a different name.
withSourceCOWLocked only held the COW lock; with data disks added, a
concurrent snapshot reflinking those same files races against a clone
that renames+symlinks them via createDriveRedirects. The new helper
collects every Role==COW || Role==Data path, sorts by string, and
locks them sequentially via withCOWPathLocked. The dictionary order
guarantees concurrent callers acquire locks in the same sequence, so
deadlock is impossible. recoverStaleBackup runs once per path after
its lock is held, matching the previous semantics for the COW.

clone.go's call site switches to the new helper; snapshot.go replaces
its bare withCOWPathLocked(cowPath, ...) the same way. The old
withSourceCOWLocked is removed.
Add hypervisor.ValidateSnapshotIntegrity (sidecar runs through
ValidateStorageConfigs and every Role==Data disk has its data-<serial>.raw
present) and hypervisor.ValidateRoleSequence (snapshot is a strict prefix
of rec; only allowed extension is trailing cidata). Each backend wraps
these into a preflight that runs BEFORE killing the live VM, so a
malformed/imported snapshot can't cost an outage.

Clone-side fixes:
  - CH restorePatchStorageConfigs filters Role==Cidata when
    ensureCloneCidata appended one (replaces tail-slice that broke when
    data disks shifted cidata away from the tail in the unfiltered list)
  - CH buildStateReplacements iterates min(len(chCfg), len(storageConfigs))
    so the +1 cidata append no longer wipes path replacements
  - CH updateDataDiskPaths rewrites Role==Data Path to the clone's runDir
    using Serial as the basename source
  - CH ensureCloneCidata calls types.ValidateStorageConfigs after the
    cidata append, since sidecar-time validation can't see the appendee
  - FC rebuildCloneStorage rebuilds in original sidecar order, rewriting
    COW + Data paths to the clone runDir, keeping Layer paths verbatim,
    and refusing the impossible Role==Cidata case explicitly

Restore-side fixes:
  - CH/FC streaming Restore preflight at staging before kill
  - CH/FC DirectRestore preflight at srcDir before kill (required
    splitting prepareRestore into resolve + kill so preflight sits between)
  - CH restoreAfterExtract uses len(sidecar) for the prefix slice instead
    of len(snapshotCfg.Disks); equivalent today but explicit about which
    file is the source of truth
cloud-init mounts: lands in user-data so cloudimg+CH guests auto-mount
their data disks. Each spec is a YAML-quoted 6-tuple
[device, mount, fstype, options, freq, passno]; quoting goes through
the same yamlQuote function as the password field, so an imported
sidecar can't break out of the YAML scalar. Default options is
"defaults,nofail" so a missing or corrupt disk doesn't keep the guest
from booting.

CH generateCidata now accepts the storageConfigs slice so it can build
the mount list itself: Role==Data with non-empty MountPoint and a
known FSType (none → guest is responsible for mkfs+mount, skip) become
mounts: rows. Device path uses /dev/disk/by-id/virtio-<serial>, which
CH exposes via chDisk.Serial. FC has no serial field on fcDrive so this
path doesn't apply there — Firecracker data disks remain /dev/vdX-only.

DirectIO becomes per-disk on CH: storageConfigToDisk and patchDisks both
prefer sc.DirectIO when non-nil, else fall back to !sc.RO && !noDirectIO.
A workload with mixed cache requirements (e.g. database + build cache on
one VM) can now opt each disk in or out independently. FC has no
equivalent knob, so configureVM emits a warn the first time DirectIO is
seen on a data disk and proceeds with IoEngine=Async.
ValidateStorageConfigs now runs at FinalizeCreate (catches buggy
producers before persisting), Start (catches DB corruption), and
Snapshot (catches drift before serialization). Clone and Restore
already validate via the snapshot preflight, so the matrix is now
complete.

cmd/vm/debug also warns when --data-disk is set: debug only prints the
launcher command, it never runs PrepareDataDisks, so silently dropping
the flag would mislead users into thinking the printed command would
materialize data disks.
- types/storage_test: ValidDataDiskName regex/length/reserved-prefix
  cases; ValidateStorageConfigs invariants per Role (RO mismatch,
  bad fstype, mount-with-fstype-none, non-absolute mount, DirectIO on
  non-data, nil entry, missing role).

- cmd/core/helpers_test: parseDataDiskSpec covers minimal/maximal
  flags, directio on/off/auto tristate, MountPointSet distinguishing
  mount=<empty> from omitted, plus the rejection cases (size below
  16MiB, reserved prefix, bad name, xfs, unknown key).
  normalizeDataDiskSpecs covers auto-name skip-already-used, default
  fstype/mount fill-in, fstype=none rejecting mount, explicit empty
  mount preservation, and duplicate-name rejection.

- metadata/metadata_test: mounts: section appears with Mounts non-empty,
  is omitted otherwise, and yamlQuote escapes single quotes (a hostile
  imported sidecar can't break out of the YAML scalar).

- clone_test: BuildStateReplacements no longer wipes the prefix when
  ensureCloneCidata leaves storageConfigs one longer than chCfg.Disks.
  RestorePatchStorageConfigs drops only the appended cidata, preserves
  trailing data disks, and is a no-op for direct-boot or
  snapshot-had-cidata cases.
- types: export FSTypeExt4 / FSTypeNone constants so cmd/core, hypervisor,
  and cloudhypervisor stop repeating the literals (goconst)
- hypervisor.utils: fold MinDataDiskSize into the existing const block
  per code-style "one block per file"
- shadow: rename inner err to vErr at every site where ValidateStorageConfigs
  shadowed an outer err (govet shadow)
- cloudhypervisor.restore: drop the unused prepareRestore helper now that
  DirectRestore inlines resolve+preflight+kill (unused)
- gofumpt: trim a stray blank line in firecracker/clone.go

make lint clean on linux + darwin; tests pass.
/simplify pass:
- DataDiskBaseName centralizes the data-<serial>.raw filename, used by
  PrepareDataDisks, ValidateSnapshotIntegrity, BaseConfig.DataDiskPath,
  CH updateDataDiskPaths, and FC rebuildCloneStorage instead of each
  reconstructing the same string
- IsDataDiskFile replaces the inline strings.HasPrefix+HasSuffix in CH
  and FC cleanSnapshotFiles matchers
- ReflinkDataDisks consolidates the identical Role==Data reflink loop
  that lived in both CH snapshot.go and FC snapshot.go (8 lines × 2)
- PrepareOCICOW now calls createSparseFile, dropping the duplicate
  OpenFile+Truncate+Close ladder
Three findings from review, all confirmed real:

P1: CH/FC preflight let snapshots through even when CH state.json,
memory-range-*, FC vmstate, mem, COW or cidata files were missing.
killForRestore would already have run by the time vm.restore /
snapshot-load actually noticed, costing an outage on a malformed tar.

  - hypervisor.ValidateSnapshotIntegrity now stats every snapshot-resident
    file in the sidecar (Role==COW, Cidata, Data) by basename. Layers are
    shared blobs, not in srcDir, so they're skipped.
  - cloudhypervisor.validateSnapshotIntegrity adds state.json + at least
    one memory-range-* check on top of the common preflight.
  - firecracker.preflightRestore adds vmstate + mem checks on top of the
    common preflight (cow.raw is covered by the sidecar walk now).

P1: FC restore / DirectRestore only locked the COW path. With data disks,
a previous clone that crashed mid-redirect could leave a stale
data-<x>.raw.cocoon-clone-backup. Restore happily wrote a new
data-<x>.raw next to it, then the next clone or snapshot acquired that
disk's lock, ran recoverStaleBackup, and renamed the backup back over
the freshly restored data — silently corrupting it.

  - Both restore paths switch to withSourceWritableDisksLocked(rec.
    StorageConfigs, ...), which already runs recoverStaleBackup on every
    writable disk (COW + Data) in dictionary order. Stale backups get
    healed before the restore touches the disks, and held locks block
    concurrent clone/snapshot from racing the swap.
writeSnapshotMeta builds the sidecar by walking chCfg.Disks in order, so
sidecar[i] and chCfg.Disks[i] must agree on Path and ReadOnly. We already
validated count, structural fields, file presence, and role sequence, but
left the per-index identity unchecked.

A drifted sidecar (tampered, partial overwrite, snapshot built by an
older cocoon version) would have slipped through preflight; clone's
patchCHConfig would then have written each sidecar entry's path into the
wrong disk slot, leaving CH device state pointing at the wrong files.

Per-index Path + ReadOnly comparison closes the gap; cost is one short
loop. Role is not on chCfg, so it stays out of this check.
Senior /code review pass found the preflight helpers
(ValidateSnapshotIntegrity, ValidateRoleSequence, DataDiskBaseName,
IsDataDiskFile) had no direct unit coverage despite being security-
critical: they're what gate restore against tampered or imported
sidecars. They're pure functions, so coverage is cheap.

Tests cover: file presence per Role, layer entries skipped (shared blob
not in srcDir), structural rejection delegated to ValidateStorageConfigs,
prefix-only role sequence with allowed trailing cidata, sidecar-longer-
than-rec rejection, and the data disk filename helpers.
Senior /code review (full repo walk, 178 files) flagged three places
where this branch had introduced or aggravated the "exported above
unexported" rule from SKILL.md:

- cloudhypervisor/helper.go: ReverseLayerSerials sat between two private
  helpers (hasMemoryRangeFile and vmAPI). Lifted it to the top of the
  helper section, right after the var block.
- hypervisor/utils.go: createSparseFile and snapshotResidentBasename are
  private callees of PrepareDataDisks / PrepareOCICOW / ValidateSnapshot
  Integrity. Moved both to the bottom of the file so the exported API
  stays contiguous.
- cmd/core/helpers.go: parseDataDiskFlags + parseDataDiskSpec +
  normalizeDataDiskSpecs sat between two exported VMConfig flag
  builders. Moved to the bottom of the private-helper region next to
  mergeResourceFlags.

The two remaining violations the audit found (hypervisor/backend.go's
SocketPath and firecracker/create.go's EnsureVmlinux) predate this
branch and are out of scope.
SKILL.md says "One single const block per file, placed at the top
(after imports)" — strictly read, that means before any type
declarations. The codebase had eight files with type-before-const, the
typed-constants pattern using forward references on the type name now
that const moves up. Fixed across the whole repo, not just this branch:

- types/storage.go, types/vm.go
- hypervisor/utils.go, hypervisor/cloudhypervisor/helper.go
- cmd/images/handler.go
- progress/oci/oci.go, progress/cloudimg/cloudimg.go
- config/config.go

The two earlier-flagged baseline function-ordering complaints
(hypervisor/backend.go SocketPath, firecracker/create.go EnsureVmlinux)
were false positives from my first-pass scanner: it conflated methods
with standalone functions. SKILL.md applies "exported above unexported"
within each category, and both files transition cleanly from
method-section (ending with private method) to standalone-section
(starting with public function). No fix needed there; my v2 script
confirms zero violations after the const moves.
@CMGS CMGS changed the title Feat/multi data disk multi data disk support Apr 27, 2026
@CMGS CMGS merged commit e4e7462 into master Apr 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant