Symptom
After cocoon vm clone --on-demand <hibernate-import-snapshot> resumes a Windows 11 guest, the guest has zero network adapters visible to the OS. Networking is completely dead — no DHCP, no ARP, no traffic of any kind on the host TAP.
Reproduce
- Start a Windows 11 cocoon VM and wait for it to fully boot + agent to come online (DHCP'd, reachable).
cocoon snapshot save --name <name> <vm-id> (hibernate)
cocoon snapshot export <name> -o - | <push to OCI>
- Later: pull,
cocoon snapshot import --name <name>-hibernate-import
cocoon vm clone --output json --name <vm-name> --cpu 4 --memory 4294967296 --network cocoon-dhcp --on-demand <name>-hibernate-import
What we observed
Pre-hibernate VM had:
- Guest MAC
a6:a1:d1:30:58:4e
- DHCP-assigned IP
172.20.0.46
- Lease in
/var/lib/cocoon/net/leases.json for that MAC
Post-wake (clone):
cocoon vm list: state=running, first_booted=true
vm.info (cloud-hypervisor API): state=Running, net[0].mac=6a:54:5c:d8:f7:e0 (a new MAC, regenerated)
- cloud-hypervisor process: alive, ~115% CPU (so guest is executing instructions)
- Guest memory: 2.9 GiB resident (snapshot fully restored)
- 30s
tcpdump -nei tap<id>-0 inside the netns: 0 packets from guest, only one host-originated IPv6 router-solicit
- From host bridge
cni0 (172.20.0.1): ping 172.20.0.46 → Destination Host Unreachable, ip neigh → FAILED
Then via cocoon vm console <id> SAC → cmd channel:
C:\Windows\System32>ipconfig /all
Windows IP Configuration
Host Name . . . . . . . . . . . . : COCOON-VM
Primary Dns Suffix . . . . . . . :
Node Type . . . . . . . . . . . . : Hybrid
IP Routing Enabled. . . . . . . . : No
WINS Proxy Enabled. . . . . . . . : No
C:\Windows\System32>arp -a
No ARP Entries Found.
C:\Windows\System32>ipconfig /renew
Windows IP Configuration
The operation failed as no adapter is in the state permissible for
this operation.
There is no "Ethernet adapter" section at all in ipconfig /all — Windows sees zero NICs. So this is not a stale-MAC / stale-DHCP-lease issue: the entire device is gone from the guest's view.
Root cause (code-level)
hypervisor/cloudhypervisor/clone.go resume sequence is:
// restoreAndResumeClone, lines ~162–197
restoreVM(ctx, hc, runDir, opts.onDemand) // vm.restore — VM paused, snapshot loaded
hotSwapNets(ctx, hc, opts.snapshotCfg.Nets, opts.networkConfigs) // <-- problem
...
resumeVM(ctx, hc) // vm.resume
hotSwapNets() (clone.go:333-353):
// Remove the snapshot-encoded NICs (with their original MACs)
for _, oldNet := range oldNets {
removeDeviceVM(ctx, hc, oldNet.ID) // virtio-net hot-removed
}
// Then attach freshly-allocated NICs with new MACs
for _, nc := range networkConfigs {
addNetVM(ctx, hc, networkConfigToNet(nc)) // virtio-net hot-added with new MAC
}
So between vm.restore and vm.resume:
- The original virtio-net device — the one Windows had bound at the time of hibernate, with PCI slot/BDF, MAC, queue config baked into its in-memory driver state — is hot-removed.
- A fresh virtio-net device with a freshly-generated MAC (no preservation; see below) is hot-added.
When vm.resume fires, Windows continues from the snapshot expecting its original NIC. From its perspective, that device is now gone. The newly-added device is a different PCI device that Windows would have to enumerate via PnP — but the snapshot was taken with the OS in a state where that PnP enumeration has already completed for the old device, and the OS doesn't naturally re-enumerate a hot-added virtio-net into a usable adapter on resume. Result: zero adapters.
Linux guests probably tolerate this better (kernel re-binds virtio drivers on hotplug), but for Windows it's effectively fatal.
Why the MAC is regenerated
Even if the same PCI slot were reused, the MAC is regenerated, which would also break DHCP-lease continuity:
types/snapshot.go:7-16 — SnapshotConfig only persists NICs int (count), not MAC/PCI/queue config. So at restore time we can't know what MAC the snapshot was taken with.
cmd/vm/run.go:75-145 Clone path calls initNetwork(...) (line 504) never passing the existing parameter.
network/cni/create.go:64-88 — Config() signature supports existing ...*types.NetworkConfig for MAC reuse, but the clone path doesn't use it.
network/bridge/bridge_linux.go:201-206 — generateMAC() produces a fresh random MAC via crypto/rand.
Suggested fix direction
The "remove old, add new" approach is fundamentally incompatible with snapshot restore for Windows guests. Two possible fixes:
-
Don't hot-swap at all on wake. Persist enough NIC config in the snapshot (SnapshotConfig.Nets: MAC, PCI slot, num_queues, etc.) that the wake path can reconstruct the exact same virtio-net topology before vm.restore. The TAP/MAC/queue config for the new clone must match the snapshot 1:1.
-
If host-side TAP must be recreated (e.g. node-bound MACs aren't portable across nodes), wire the existing ...*types.NetworkConfig capability already present in network/cni/create.go:64-88 through the clone flow so at least the same MAC is reused. This still won't fix the Windows hot-swap issue but would let Linux guests resume cleanly.
Option 1 is the only one that keeps Windows wakes working. Option 2 alone is insufficient.
Environment
- cocoon: current
main (binary deployed on cocoonset-node-1, asia-southeast1-b GKE cocoonset cluster)
- cloud-hypervisor: bundled
- Guest: Windows 11 (cocoon
simular/win11-hot:20260505-testing-1), DHCP via cocoon-dhcp
- Network plugin: CNI
Related
Symptom
After
cocoon vm clone --on-demand <hibernate-import-snapshot>resumes a Windows 11 guest, the guest has zero network adapters visible to the OS. Networking is completely dead — no DHCP, no ARP, no traffic of any kind on the host TAP.Reproduce
cocoon snapshot save --name <name> <vm-id>(hibernate)cocoon snapshot export <name> -o - | <push to OCI>cocoon snapshot import --name <name>-hibernate-importcocoon vm clone --output json --name <vm-name> --cpu 4 --memory 4294967296 --network cocoon-dhcp --on-demand <name>-hibernate-importWhat we observed
Pre-hibernate VM had:
a6:a1:d1:30:58:4e172.20.0.46/var/lib/cocoon/net/leases.jsonfor that MACPost-wake (clone):
cocoon vm list: state=running, first_booted=truevm.info(cloud-hypervisor API):state=Running,net[0].mac=6a:54:5c:d8:f7:e0(a new MAC, regenerated)tcpdump -nei tap<id>-0inside the netns: 0 packets from guest, only one host-originated IPv6 router-solicitcni0(172.20.0.1):ping 172.20.0.46→Destination Host Unreachable,ip neigh→FAILEDThen via
cocoon vm console <id>SAC → cmd channel:There is no "Ethernet adapter" section at all in
ipconfig /all— Windows sees zero NICs. So this is not a stale-MAC / stale-DHCP-lease issue: the entire device is gone from the guest's view.Root cause (code-level)
hypervisor/cloudhypervisor/clone.goresume sequence is:hotSwapNets()(clone.go:333-353):So between
vm.restoreandvm.resume:When
vm.resumefires, Windows continues from the snapshot expecting its original NIC. From its perspective, that device is now gone. The newly-added device is a different PCI device that Windows would have to enumerate via PnP — but the snapshot was taken with the OS in a state where that PnP enumeration has already completed for the old device, and the OS doesn't naturally re-enumerate a hot-added virtio-net into a usable adapter on resume. Result: zero adapters.Linux guests probably tolerate this better (kernel re-binds virtio drivers on hotplug), but for Windows it's effectively fatal.
Why the MAC is regenerated
Even if the same PCI slot were reused, the MAC is regenerated, which would also break DHCP-lease continuity:
types/snapshot.go:7-16—SnapshotConfigonly persistsNICs int(count), not MAC/PCI/queue config. So at restore time we can't know what MAC the snapshot was taken with.cmd/vm/run.go:75-145Clone path callsinitNetwork(...)(line 504) never passing theexistingparameter.network/cni/create.go:64-88—Config()signature supportsexisting ...*types.NetworkConfigfor MAC reuse, but the clone path doesn't use it.network/bridge/bridge_linux.go:201-206—generateMAC()produces a fresh random MAC viacrypto/rand.Suggested fix direction
The "remove old, add new" approach is fundamentally incompatible with snapshot restore for Windows guests. Two possible fixes:
Don't hot-swap at all on wake. Persist enough NIC config in the snapshot (
SnapshotConfig.Nets: MAC, PCI slot, num_queues, etc.) that the wake path can reconstruct the exact same virtio-net topology beforevm.restore. The TAP/MAC/queue config for the new clone must match the snapshot 1:1.If host-side TAP must be recreated (e.g. node-bound MACs aren't portable across nodes), wire the
existing ...*types.NetworkConfigcapability already present innetwork/cni/create.go:64-88through the clone flow so at least the same MAC is reused. This still won't fix the Windows hot-swap issue but would let Linux guests resume cleanly.Option 1 is the only one that keeps Windows wakes working. Option 2 alone is insufficient.
Environment
main(binary deployed oncocoonset-node-1, asia-southeast1-b GKE cocoonset cluster)simular/win11-hot:20260505-testing-1), DHCP via cocoon-dhcpRelated