Flatten rootfs qcow2 before archiving (cross-worker portability)#132
Open
breardon2011 wants to merge 1 commit intomainfrom
Open
Flatten rootfs qcow2 before archiving (cross-worker portability)#132breardon2011 wants to merge 1 commit intomainfrom
breardon2011 wants to merge 1 commit intomainfrom
Conversation
The cached rootfs.qcow2 for a sandbox is a thin qcow2 overlay with its
backing file set to `/data/firecracker/images/default.ext4`. The base
ext4 image is rebuilt independently on each worker from the same
Dockerfile, but `mkfs.ext4` generates a random UUID per build, so the
raw byte content of default.ext4 differs between workers even when the
logical filesystem content is identical.
When a checkpoint or hibernation archive is uploaded from worker A and
downloaded on worker B, the qcow2's backing reference still points at
the local path. Worker B resolves unchanged clusters through ITS OWN
default.ext4, which has different bytes. The restored guest's ext4
metadata (captured in the memory snapshot at savevm time) references
specific cluster contents and checksums; the mismatch surfaces in the
guest as EBADMSG ("Bad message") on every file read — so `ip`,
`hostname`, dynamic linker lookups, etc. all fail and the fork is
effectively useless.
Observed in prod (opencomputer-prod eastus2) with PR #128 deployed:
10-checkpoint fork test, 5/10 forks corrupt. Looking at the worker
distribution, every fork that landed on the SAME worker as the source
sandbox passed, and every fork that landed on the OTHER worker
corrupted. md5 of default.ext4 differed between the two workers.
Fix: before archiving, `qemu-img rebase -b "" <rootfs>` merges the
backing file's data into the overlay so the qcow2 is fully
self-contained and cross-worker portable. `rebase` preserves internal
savevm snapshots (unlike `convert`), which is required so loadvm on
the destination can still restore the "cp-<id>" snapshot.
Applied to both archival paths:
- CreateCheckpoint: reflink-stage the archive files to a temp dir,
flatten rootfs there, tar+upload. Keeps the local-fork cacheDir
copy as a thin overlay (fast same-worker forks still work).
- doHibernate: flatten happens in-place in archiveDir before tar.
Tradeoff: archive size grows from ~150MB to ~1.5GB because the base
ext4 content is now embedded. Necessary for cross-worker correctness;
optimization for size can come later via deterministic default.ext4
builds (fixed UUID + fixed hash_seed).
Same approach used on the autoscaling-etc branch for
MigrateToS3Flatten.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After PR #128 merged, prod surfaced a residual cross-worker fork corruption: forks that land on a different worker from the one that created the checkpoint come up with EBADMSG on every file read. Same-worker forks work fine.
Root cause
The cached
rootfs.qcow2for each sandbox is a thin qcow2 overlay with backing file =/data/firecracker/images/default.ext4. The base ext4 image is rebuilt independently on each worker from the same Dockerfile, but:mkfs.ext4generates a random UUID per build.So byte content of
default.ext4differs between workers even when logical filesystem content is identical. The qcow2 overlay only stores cluster deltas; unchanged clusters are resolved through the backing file. On a cross-worker download, the target worker resolves those unchanged clusters through ITS OWNdefault.ext4(different bytes), and the guest's restored ext4 metadata (captured in the memory snapshot at savevm time) fails checksum verification → EBADMSG.Verified by inspection: md5 of
default.ext4differs betweenoc-worker-1andoc-worker-2; the cross-worker forks that failed were the ones that had to download from S3 and resolve through a different backing.Fix
Before archiving,
qemu-img rebase -b "" <rootfs>merges backing-file content into the overlay so the qcow2 is self-contained. Unlikeqemu-img convert,rebasepreserves internal savevm snapshots — critical becauseloadvmon the destination needs thecp-<id>snapshot intact.Applied to both archival paths:
CreateCheckpoint— reflink-stage to a temp archive dir, flatten rootfs there, tar+upload. Leaves the local-forkcacheDircopy as a thin overlay so same-worker forks stay fast.doHibernate— flatten in-place inarchiveDirbefore tar.Same approach as
autoscaling-etcbranch'sMigrateToS3Flatten— that branch already solved this for live migration but the fix never made it to main.Tradeoff
Archive size grows from ~150MB to ~1.5GB because the base ext4 content is now embedded. Necessary for cross-worker correctness; size optimization can come later via deterministic
default.ext4builds (fixed UUID + fixedhash_seed).Test plan
default.ext4across prod workers and inspecting the qcow2'sbacking filefield.scripts/integration-tests/02-fork-no-corruption.tsagainst prod — all 10 forks should pass regardless of which worker they land on.sdks/typescript/examples/test-secret-store-fork.tsagainst prod — should return to 27/27.