-
Notifications
You must be signed in to change notification settings - Fork 452
feat(engine): sparse delta compression for disk-based weight updates #1125
Copy link
Copy link
Open
Description
Summary
Add sparse delta encoding and compression for the type="disk" weight update path, reducing checkpoint transfer volume by ~50-100× for RL training.
Motivation
Between consecutive RL training steps, >98% of bf16 parameters remain bit-identical. The xccl path now skips unchanged parameters using hash-based detection. However, the disk path (_update_weights_from_disk) still writes and reads the full model checkpoint every time.
For cross-region and decentralized RL setups where weight sync happens through shared object storage (S3/GCS), the disk path is the primary transfer mechanism. Sending only the changed elements instead of the full checkpoint would drastically reduce transfer time and storage cost.
Proposed Approach
- Detect changed elements: After optimizer step, compare current weights vs previous weights element-wise in bf16. Only ~1-2% of elements typically change.
- Sparse encode: For each parameter, store only
(indices, values)of changed elements instead of the full tensor. - Compress: Apply lossless compression (e.g., zstd) to the sparse representation. Index sorting + delta encoding makes the index stream highly compressible.
- Checkpoint chain: Periodically write full "anchor" checkpoints (every N steps). Between anchors, write only sparse deltas. This bounds reconstruction cost.
- Reconstruct: Inference workers download base + chain of deltas, apply sequentially, verify per-tensor checksums for bit-identical reconstruction.
Key properties
- Lossless: Bit-identical reconstruction guaranteed (no floating-point drift)
- Bounded chain: Full anchor every N steps prevents unbounded delta accumulation
- Integrity: Per-tensor checksum verification after reconstruction
- Independent of xccl path: This is a separate optimization for the disk-based weight update flow
Files to modify
areal/engine/fsdp_engine.py—_update_weights_from_disk(),_save_model_to_hf()areal/experimental/engine/archon_weight_sync.py—update_weights_from_disk()areal/experimental/engine/archon_checkpoint.py—save_model_to_hf()- New:
areal/utils/sparse_checkpoint.py— Sparse encoding/decoding/compression utilities
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels