Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640
Open
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
Open
Add broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
broadcast_sigterm_every_n_steps to reduce SIGTERM broadcast overhead#21640c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
Conversation
Allow users to control how often SIGTERM status is broadcast across ranks, reducing CPU-GPU sync overhead for fast training loops while preserving the default every-step behavior. Closes Lightning-AI#21487
for more information, see https://pre-commit.ci
When broadcast_sigterm_every_n_steps > 1, SIGTERM could arrive between broadcasts near the end of an epoch. Without a forced check, rank 0 would exit while other ranks hang waiting at the next collective (e.g. validation barrier). This adds a forced broadcast whenever the epoch ends or validation is about to start.
for more information, see https://pre-commit.ci
Contributor
Author
The test was setting _devices_flag=2 to simulate multi-GPU, but the code checks trainer.world_size (from strategy.world_size) which remained 1 in the test Trainer. Mock the property directly instead.
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #21640 +/- ##
=========================================
- Coverage 87% 79% -8%
=========================================
Files 270 267 -3
Lines 23934 23885 -49
=========================================
- Hits 20713 18809 -1904
- Misses 3221 5076 +1855 |
- Fix epoch boundary test: mock world_size property instead of _devices_flag, which didn't affect trainer.world_size - Rewrite interval test to call real advance() with mocked distributed instead of reimplementing the logic in the test - Add ddp_spawn integration test exercising real NCCL broadcasts on 2 GPUs with non-aligned step count to trigger epoch-end flush
for more information, see https://pre-commit.ci
|
This is a great way to handle it! Super thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
broadcast_sigterm_every_n_stepsparameter to the Trainer that controls how often SIGTERM status is broadcast across ranks in distributed training. Default is1(every step), preserving current behavior.The NCCL broadcast in
_broadcast_sigterm_tensor(#20825) adds a fixed-cost CPU-GPU sync every step. For fast training loops this overhead is significant relative to step time. Benchmark details and measurements in #21487.Tradeoff
Worst-case SIGTERM detection delay is
(N-1) × step_time. This is safe because SIGTERM grace periods are 30–120s, and users who benefit most (fast loops) have the smallest absolute delay (e.g., N=10, step=0.5ms → 4.5ms).Design
The parameter lives on
Trainerfollowing the existingevery_npattern (log_every_n_steps,check_val_every_n_epoch,reload_dataloaders_every_n_epochs) rather thanStrategy, which holds communication mechanics, not loop frequency policy.Test plan
Closes #21487
📚 Documentation preview 📚: https://pytorch-lightning--21640.org.readthedocs.build/en/21640/