fix: clone async checkpoint tensors to CPU to prevent GPU OOM by karthik-idikuda · Pull Request #21631 · Lightning-AI/pytorch-lightning

karthik-idikuda · 2026-03-31T10:20:00Z

Summary

Clone async checkpoint tensors to CPU to prevent GPU OOM during async saves.

Details

AsyncCheckpointIO._clone_tensor() previously called t.detach().clone(), which allocates new GPU memory for each cloned tensor. For large model checkpoints (e.g., 15GB+), this doubles GPU memory usage during checkpoint saves, which can cause OOM errors.

Before (GPU clone):

[ASYNC CHECKPOINT BEFORE clone] GPU 0: allocated=21.54 GB
[ASYNC CHECKPOINT AFTER clone]  GPU 0: allocated=37.54 GB  ← +16 GB!

After (CPU clone):

[ASYNC CHECKPOINT BEFORE clone] GPU 0: allocated=21.54 GB
[ASYNC CHECKPOINT AFTER clone]  GPU 0: allocated=21.54 GB  ← no change

Changes

async_plugin.py: Changed t.detach().clone() → t.detach().cpu().clone() in _clone_tensor(). Moving to CPU first avoids doubling GPU memory. CPU memory is typically abundant and this still prevents the race condition with parameter mutation.
test_async_checkpoint.py: Added test_async_checkpoint_clones_tensors_to_cpu() to verify cloned tensors are on CPU and retain correct values.

Fixes #21630

📚 Documentation preview 📚: https://pytorch-lightning--21631.org.readthedocs.build/en/21631/

Move tensor cloning to CPU in AsyncCheckpointIO._clone_tensor() to prevent doubling GPU memory usage during async checkpoint saves. Previously, _clone_tensor() called t.detach().clone() which allocates new GPU memory for each cloned tensor. For large model checkpoints (e.g., 15GB+), this can cause GPU OOM errors since the entire checkpoint is temporarily duplicated in GPU memory. The fix changes the operation to t.detach().cpu().clone(), which moves tensors to CPU before cloning. CPU memory is typically abundant and this achieves the same race-condition prevention without the GPU memory overhead. Fixes Lightning-AI#21630

tests/tests_pytorch/plugins/test_async_checkpoint.py

codecov · 2026-03-31T15:30:45Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (612ab08) to head (019a502).
⚠️ Report is 4 commits behind head on master.
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (612ab08) and HEAD (019a502). Click for more details.

HEAD has 920 uploads less than BASE

Flag BASE (612ab08) HEAD (019a502)

cpu 251 42

lightning_fabric 80 0

pytest 125 0

python3.12 72 12

python 18 3

lightning 90 15

python3.11 36 6

python3.13 53 9

python3.12.7 54 9

python3.10 18 3

pytorch_lightning 81 27

pytorch2.7 9 3

pytest-full 126 42

pytorch2.1 18 6

pytorch2.4.1 9 3

pytorch2.5.1 9 3

pytorch2.2.2 9 3

pytorch2.9 18 6

pytorch2.10 18 6

pytorch2.8 18 6

pytorch2.3 9 3

pytorch2.6 9 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21631     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23898    23877     -21     
=========================================
- Hits        20678    18799   -1879     
- Misses       3220     5078   +1858

src/lightning/pytorch/plugins/io/async_plugin.py

For CUDA tensors, cpu() already allocates a new host-memory copy, so an additional clone() is unnecessary and wastes memory bandwidth. For CPU tensors cpu() is a no-op, so clone() remains necessary. Co-authored-by: TheGreatFrankie

for more information, see https://pre-commit.ci

src/lightning/pytorch/plugins/io/async_plugin.py

justusschock · 2026-04-01T13:09:54Z

I am not sure we want this to change. The reason being that a transfer from GPU to CPU is a synchronization point and the whole point of async checkpointing is to avoid those. If in doubt, you can still use synchronous checkpointing.

TheGreatFrankie · 2026-04-01T21:42:57Z

I am not sure we want this to change. The reason being that a transfer from GPU to CPU is a synchronization point and the whole point of async checkpointing is to avoid those. If in doubt, you can still use synchronous checkpointing.

🤔 but the original GPU to GPU clone in memory is also a synchronization point. I don't fully understand your point.
If GPU to GPU clone is a synchronization point and GPU to CPU clone is also a synchronization point. why don't we clone to CPU to save GPU memory?

…correctly

karthik-idikuda requested review from ethanwharris, justusschock and tchaton as code owners March 31, 2026 10:20

github-actions bot added the pl Generic label for PyTorch Lightning package label Mar 31, 2026

karthik-idikuda force-pushed the fix/async-checkpoint-clone-to-cpu branch from 71adffe to 95d6ff2 Compare March 31, 2026 10:33

karthik-idikuda force-pushed the fix/async-checkpoint-clone-to-cpu branch from 95d6ff2 to 865d600 Compare March 31, 2026 10:55

deependujha reviewed Mar 31, 2026

View reviewed changes

tests/tests_pytorch/plugins/test_async_checkpoint.py Outdated Show resolved Hide resolved

deependujha reviewed Mar 31, 2026

View reviewed changes

tests/tests_pytorch/plugins/test_async_checkpoint.py Outdated Show resolved Hide resolved

TheGreatFrankie reviewed Mar 31, 2026

View reviewed changes

src/lightning/pytorch/plugins/io/async_plugin.py Outdated Show resolved Hide resolved

karthik-idikuda and others added 2 commits April 1, 2026 07:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

bded2bb

for more information, see https://pre-commit.ci

deependujha reviewed Apr 1, 2026

View reviewed changes

src/lightning/pytorch/plugins/io/async_plugin.py Outdated Show resolved Hide resolved

fix: update tensor cloning logic to ensure CPU snapshots are created …

019a502

…correctly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clone async checkpoint tensors to CPU to prevent GPU OOM#21631

fix: clone async checkpoint tensors to CPU to prevent GPU OOM#21631
karthik-idikuda wants to merge 4 commits intoLightning-AI:masterfrom
karthik-idikuda:fix/async-checkpoint-clone-to-cpu

karthik-idikuda commented Mar 31, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

justusschock commented Apr 1, 2026 •

edited

Loading

Uh oh!

TheGreatFrankie commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

karthik-idikuda commented Mar 31, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Before (GPU clone):

After (CPU clone):

Changes

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

justusschock commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheGreatFrankie commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karthik-idikuda commented Mar 31, 2026 •

edited by github-actions bot

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading

justusschock commented Apr 1, 2026 •

edited

Loading