Add `broadcast_sigterm_every_n_steps` to reduce SIGTERM broadcast overhead by c-pozzi · Pull Request #21640 · Lightning-AI/pytorch-lightning

c-pozzi · 2026-04-04T11:13:43Z

Summary

Adds a broadcast_sigterm_every_n_steps parameter to the Trainer that controls how often SIGTERM status is broadcast across ranks in distributed training. Default is 1 (every step), preserving current behavior.

The NCCL broadcast in _broadcast_sigterm_tensor (#20825) adds a fixed-cost CPU-GPU sync every step. For fast training loops this overhead is significant relative to step time. Benchmark details and measurements in #21487.

Tradeoff

Worst-case SIGTERM detection delay is (N-1) × step_time. This is safe because SIGTERM grace periods are 30–120s, and users who benefit most (fast loops) have the smallest absolute delay (e.g., N=10, step=0.5ms → 4.5ms).

Design

The parameter lives on Trainer following the existing every_n pattern (log_every_n_steps, check_val_every_n_epoch, reload_dataloaders_every_n_epochs) rather than Strategy, which holds communication mechanics, not loop frequency policy.

trainer = Trainer(broadcast_sigterm_every_n_steps=10)

Test plan

Validation rejects values < 1
Default value is 1
Broadcast interval logic correct for N=1, 5, 10

Closes #21487

📚 Documentation preview 📚: https://pytorch-lightning--21640.org.readthedocs.build/en/21640/

Allow users to control how often SIGTERM status is broadcast across ranks, reducing CPU-GPU sync overhead for fast training loops while preserving the default every-step behavior. Closes Lightning-AI#21487

for more information, see https://pre-commit.ci

When broadcast_sigterm_every_n_steps > 1, SIGTERM could arrive between broadcasts near the end of an epoch. Without a forced check, rank 0 would exit while other ranks hang waiting at the next collective (e.g. validation barrier). This adds a forced broadcast whenever the epoch ends or validation is about to start.

for more information, see https://pre-commit.ci

c-pozzi · 2026-04-06T09:10:28Z

@justusschock @mojtababahrami

The test was setting _devices_flag=2 to simulate multi-GPU, but the code checks trainer.world_size (from strategy.world_size) which remained 1 in the test Trainer. Mock the property directly instead.

codecov · 2026-04-06T12:52:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79%. Comparing base (bb7820f) to head (4aa8af7).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (bb7820f) and HEAD (4aa8af7). Click for more details.

HEAD has 2053 uploads less than BASE

Flag BASE (bb7820f) HEAD (4aa8af7)

cpu 503 42

python 36 3

lightning_fabric 162 0

pytest 252 0

python3.13 108 9

lightning 179 15

python3.11 71 6

python3.12 144 12

python3.10 36 3

python3.12.7 108 9

pytorch2.1 36 6

pytest-full 251 42

pytorch_lightning 162 27

pytorch2.7 18 3

pytorch2.8 36 6

pytorch2.10 36 6

pytorch2.3 17 3

pytorch2.2.2 18 3

pytorch2.9 36 6

pytorch2.5.1 18 3

pytorch2.4.1 18 3

pytorch2.6 18 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21640     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23934    23885     -49     
=========================================
- Hits        20713    18809   -1904     
- Misses       3221     5076   +1855

- Fix epoch boundary test: mock world_size property instead of _devices_flag, which didn't affect trainer.world_size - Rewrite interval test to call real advance() with mocked distributed instead of reimplementing the logic in the test - Add ddp_spawn integration test exercising real NCCL broadcasts on 2 GPUs with non-aligned step count to trigger epoch-end flush

for more information, see https://pre-commit.ci

mojtababahrami · 2026-04-08T10:17:26Z

This is a great way to handle it! Super thanks!

Add broadcast_sigterm_every_n_steps Trainer parameter

491e4b6

Allow users to control how often SIGTERM status is broadcast across ranks, reducing CPU-GPU sync overhead for fast training loops while preserving the default every-step behavior. Closes Lightning-AI#21487

github-actions bot added the pl Generic label for PyTorch Lightning package label Apr 4, 2026

pre-commit-ci bot and others added 3 commits April 4, 2026 11:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

b2ea3e2

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

4af18f5

for more information, see https://pre-commit.ci

c-pozzi marked this pull request as ready for review April 6, 2026 09:06

c-pozzi requested review from ethanwharris, justusschock and tchaton as code owners April 6, 2026 09:06

Fix epoch boundary test to mock world_size property

e2d7496

The test was setting _devices_flag=2 to simulate multi-GPU, but the code checks trainer.world_size (from strategy.world_size) which remained 1 in the test Trainer. Mock the property directly instead.

c-pozzi and others added 2 commits April 6, 2026 13:26

[pre-commit.ci] auto fixes from pre-commit.com hooks

4aa8af7

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `broadcast_sigterm_every_n_steps` to reduce SIGTERM broadcast overhead#21640

Add `broadcast_sigterm_every_n_steps` to reduce SIGTERM broadcast overhead#21640
c-pozzi wants to merge 7 commits intoLightning-AI:masterfrom
c-pozzi:fix/sigterm-broadcast-interval-21487

c-pozzi commented Apr 4, 2026 •

edited by github-actions bot

Loading

Uh oh!

c-pozzi commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

mojtababahrami commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

c-pozzi commented Apr 4, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tradeoff

Design

Test plan

Uh oh!

c-pozzi commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mojtababahrami commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

c-pozzi commented Apr 4, 2026 •

edited by github-actions bot

Loading

codecov bot commented Apr 6, 2026 •

edited

Loading