Commit 71adffe
Nytrynox
fix: clone async checkpoint tensors to CPU to prevent GPU OOM
Move tensor cloning to CPU in AsyncCheckpointIO._clone_tensor() to prevent
doubling GPU memory usage during async checkpoint saves.
Previously, _clone_tensor() called t.detach().clone() which allocates new GPU
memory for each cloned tensor. For large model checkpoints (e.g., 15GB+), this
can cause GPU OOM errors since the entire checkpoint is temporarily duplicated
in GPU memory.
The fix changes the operation to t.detach().cpu().clone(), which moves tensors
to CPU before cloning. CPU memory is typically abundant and this achieves the
same race-condition prevention without the GPU memory overhead.
Fixes #216301 parent 612ab08 commit 71adffe
File tree
2 files changed
+21
-3
lines changed- src/lightning/pytorch/plugins/io
- tests/tests_pytorch/plugins
2 files changed
+21
-3
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | | - | |
99 | | - | |
100 | | - | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
0 commit comments