Skip to content

Flatten rootfs qcow2 before archiving (cross-worker portability)#132

Open
breardon2011 wants to merge 1 commit intomainfrom
fix/cross-worker-rootfs-backing
Open

Flatten rootfs qcow2 before archiving (cross-worker portability)#132
breardon2011 wants to merge 1 commit intomainfrom
fix/cross-worker-rootfs-backing

Conversation

@breardon2011
Copy link
Copy Markdown
Contributor

Summary

After PR #128 merged, prod surfaced a residual cross-worker fork corruption: forks that land on a different worker from the one that created the checkpoint come up with EBADMSG on every file read. Same-worker forks work fine.

Root cause

The cached rootfs.qcow2 for each sandbox is a thin qcow2 overlay with backing file = /data/firecracker/images/default.ext4. The base ext4 image is rebuilt independently on each worker from the same Dockerfile, but:

  • mkfs.ext4 generates a random UUID per build.
  • The ext4 inode table layout can vary between builds.

So byte content of default.ext4 differs between workers even when logical filesystem content is identical. The qcow2 overlay only stores cluster deltas; unchanged clusters are resolved through the backing file. On a cross-worker download, the target worker resolves those unchanged clusters through ITS OWN default.ext4 (different bytes), and the guest's restored ext4 metadata (captured in the memory snapshot at savevm time) fails checksum verification → EBADMSG.

Verified by inspection: md5 of default.ext4 differs between oc-worker-1 and oc-worker-2; the cross-worker forks that failed were the ones that had to download from S3 and resolve through a different backing.

Fix

Before archiving, qemu-img rebase -b "" <rootfs> merges backing-file content into the overlay so the qcow2 is self-contained. Unlike qemu-img convert, rebase preserves internal savevm snapshots — critical because loadvm on the destination needs the cp-<id> snapshot intact.

Applied to both archival paths:

  • CreateCheckpoint — reflink-stage to a temp archive dir, flatten rootfs there, tar+upload. Leaves the local-fork cacheDir copy as a thin overlay so same-worker forks stay fast.
  • doHibernate — flatten in-place in archiveDir before tar.

Same approach as autoscaling-etc branch's MigrateToS3Flatten — that branch already solved this for live migration but the fix never made it to main.

Tradeoff

Archive size grows from ~150MB to ~1.5GB because the base ext4 content is now embedded. Necessary for cross-worker correctness; size optimization can come later via deterministic default.ext4 builds (fixed UUID + fixed hash_seed).

Test plan

  • Verified root cause by comparing md5 of default.ext4 across prod workers and inspecting the qcow2's backing file field.
  • After merge + deploy, re-run scripts/integration-tests/02-fork-no-corruption.ts against prod — all 10 forks should pass regardless of which worker they land on.
  • Re-run sdks/typescript/examples/test-secret-store-fork.ts against prod — should return to 27/27.

The cached rootfs.qcow2 for a sandbox is a thin qcow2 overlay with its
backing file set to `/data/firecracker/images/default.ext4`. The base
ext4 image is rebuilt independently on each worker from the same
Dockerfile, but `mkfs.ext4` generates a random UUID per build, so the
raw byte content of default.ext4 differs between workers even when the
logical filesystem content is identical.

When a checkpoint or hibernation archive is uploaded from worker A and
downloaded on worker B, the qcow2's backing reference still points at
the local path. Worker B resolves unchanged clusters through ITS OWN
default.ext4, which has different bytes. The restored guest's ext4
metadata (captured in the memory snapshot at savevm time) references
specific cluster contents and checksums; the mismatch surfaces in the
guest as EBADMSG ("Bad message") on every file read — so `ip`,
`hostname`, dynamic linker lookups, etc. all fail and the fork is
effectively useless.

Observed in prod (opencomputer-prod eastus2) with PR #128 deployed:
10-checkpoint fork test, 5/10 forks corrupt. Looking at the worker
distribution, every fork that landed on the SAME worker as the source
sandbox passed, and every fork that landed on the OTHER worker
corrupted. md5 of default.ext4 differed between the two workers.

Fix: before archiving, `qemu-img rebase -b "" <rootfs>` merges the
backing file's data into the overlay so the qcow2 is fully
self-contained and cross-worker portable. `rebase` preserves internal
savevm snapshots (unlike `convert`), which is required so loadvm on
the destination can still restore the "cp-<id>" snapshot.

Applied to both archival paths:
- CreateCheckpoint: reflink-stage the archive files to a temp dir,
  flatten rootfs there, tar+upload. Keeps the local-fork cacheDir
  copy as a thin overlay (fast same-worker forks still work).
- doHibernate: flatten happens in-place in archiveDir before tar.

Tradeoff: archive size grows from ~150MB to ~1.5GB because the base
ext4 content is now embedded. Necessary for cross-worker correctness;
optimization for size can come later via deterministic default.ext4
builds (fixed UUID + fixed hash_seed).

Same approach used on the autoscaling-etc branch for
MigrateToS3Flatten.
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
opensandbox Ready Ready Preview, Comment Apr 14, 2026 4:54am

Request Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant