Skip to content

feat(engine): sparse delta compression for disk-based weight updates #1125

@rchardx

Description

@rchardx

Summary

Add sparse delta encoding and compression for the type="disk" weight update path, reducing checkpoint transfer volume by ~50-100× for RL training.

Motivation

Between consecutive RL training steps, >98% of bf16 parameters remain bit-identical. The xccl path now skips unchanged parameters using hash-based detection. However, the disk path (_update_weights_from_disk) still writes and reads the full model checkpoint every time.

For cross-region and decentralized RL setups where weight sync happens through shared object storage (S3/GCS), the disk path is the primary transfer mechanism. Sending only the changed elements instead of the full checkpoint would drastically reduce transfer time and storage cost.

Proposed Approach

  1. Detect changed elements: After optimizer step, compare current weights vs previous weights element-wise in bf16. Only ~1-2% of elements typically change.
  2. Sparse encode: For each parameter, store only (indices, values) of changed elements instead of the full tensor.
  3. Compress: Apply lossless compression (e.g., zstd) to the sparse representation. Index sorting + delta encoding makes the index stream highly compressible.
  4. Checkpoint chain: Periodically write full "anchor" checkpoints (every N steps). Between anchors, write only sparse deltas. This bounds reconstruction cost.
  5. Reconstruct: Inference workers download base + chain of deltas, apply sequentially, verify per-tensor checksums for bit-identical reconstruction.

Key properties

  • Lossless: Bit-identical reconstruction guaranteed (no floating-point drift)
  • Bounded chain: Full anchor every N steps prevents unbounded delta accumulation
  • Integrity: Per-tensor checksum verification after reconstruction
  • Independent of xccl path: This is a separate optimization for the disk-based weight update flow

Files to modify

  • areal/engine/fsdp_engine.py_update_weights_from_disk(), _save_model_to_hf()
  • areal/experimental/engine/archon_weight_sync.pyupdate_weights_from_disk()
  • areal/experimental/engine/archon_checkpoint.pysave_model_to_hf()
  • New: areal/utils/sparse_checkpoint.py — Sparse encoding/decoding/compression utilities

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions