Skip to content

Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640

Open
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
c-pozzi:fix/sigterm-broadcast-interval-21487
Open

Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
c-pozzi:fix/sigterm-broadcast-interval-21487

Conversation

@c-pozzi
Copy link
Copy Markdown
Contributor

@c-pozzi c-pozzi commented Apr 4, 2026

Summary

Adds a broadcast_sigterm_every_n_steps parameter to the Trainer that controls how often SIGTERM status is broadcast across ranks in distributed training. Default is 1 (every step), preserving current behavior.

The NCCL broadcast in _broadcast_sigterm_tensor (#20825) adds a fixed-cost CPU-GPU sync every step. For fast training loops this overhead is significant relative to step time. Benchmark details and measurements in #21487.

Tradeoff

Worst-case SIGTERM detection delay is (N-1) × step_time. This is safe because SIGTERM grace periods are 30–120s, and users who benefit most (fast loops) have the smallest absolute delay (e.g., N=10, step=0.5ms → 4.5ms).

Design

The parameter lives on Trainer following the existing every_n pattern (log_every_n_steps, check_val_every_n_epoch, reload_dataloaders_every_n_epochs) rather than Strategy, which holds communication mechanics, not loop frequency policy.

trainer = Trainer(broadcast_sigterm_every_n_steps=10)

Test plan

  • Validation rejects values < 1
  • Default value is 1
  • Broadcast interval logic correct for N=1, 5, 10

Closes #21487


📚 Documentation preview 📚: https://pytorch-lightning--21640.org.readthedocs.build/en/21640/

Allow users to control how often SIGTERM status is broadcast across
ranks, reducing CPU-GPU sync overhead for fast training loops while
preserving the default every-step behavior.

Closes Lightning-AI#21487
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Apr 4, 2026
pre-commit-ci bot and others added 3 commits April 4, 2026 11:14
When broadcast_sigterm_every_n_steps > 1, SIGTERM could arrive between
broadcasts near the end of an epoch. Without a forced check, rank 0
would exit while other ranks hang waiting at the next collective
(e.g. validation barrier). This adds a forced broadcast whenever the
epoch ends or validation is about to start.
@c-pozzi c-pozzi marked this pull request as ready for review April 6, 2026 09:06
@c-pozzi
Copy link
Copy Markdown
Contributor Author

c-pozzi commented Apr 6, 2026

The test was setting _devices_flag=2 to simulate multi-GPU, but the
code checks trainer.world_size (from strategy.world_size) which
remained 1 in the test Trainer. Mock the property directly instead.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79%. Comparing base (bb7820f) to head (4aa8af7).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (bb7820f) and HEAD (4aa8af7). Click for more details.

HEAD has 2053 uploads less than BASE
Flag BASE (bb7820f) HEAD (4aa8af7)
cpu 503 42
python 36 3
lightning_fabric 162 0
pytest 252 0
python3.13 108 9
lightning 179 15
python3.11 71 6
python3.12 144 12
python3.10 36 3
python3.12.7 108 9
pytorch2.1 36 6
pytest-full 251 42
pytorch_lightning 162 27
pytorch2.7 18 3
pytorch2.8 36 6
pytorch2.10 36 6
pytorch2.3 17 3
pytorch2.2.2 18 3
pytorch2.9 36 6
pytorch2.5.1 18 3
pytorch2.4.1 18 3
pytorch2.6 18 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21640     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23934    23885     -49     
=========================================
- Hits        20713    18809   -1904     
- Misses       3221     5076   +1855     

c-pozzi and others added 2 commits April 6, 2026 13:26
- Fix epoch boundary test: mock world_size property instead of
  _devices_flag, which didn't affect trainer.world_size
- Rewrite interval test to call real advance() with mocked distributed
  instead of reimplementing the logic in the test
- Add ddp_spawn integration test exercising real NCCL broadcasts on
  2 GPUs with non-aligned step count to trigger epoch-end flush
@mojtababahrami
Copy link
Copy Markdown

This is a great way to handle it! Super thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

forced CPU-GPU optimization at the end of every training step resulting in underutilization of GPU

2 participants