Bug description
In ignite/distributed/utils.py, the one_rank_first context manager is susceptible to a distributed deadlock if an exception occurs within the context block on the target rank.
The current implementation uses two barriers to synchronize ranks. If the process with the designated rank encounters an error (e.g., a network failure during data download or a disk I/O error) within the yield block, it will never reach the second barrier. Consequently, all other processes that are either waiting at the first barrier or expecting to synchronize at the second will hang indefinitely.
Steps to reproduce
Run the following logic in a distributed environment (2+ ranks):
import ignite.distributed as idist
# Simulate a crash only on rank 0
with idist.one_rank_first(rank=0):
if idist.get_rank() == 0:
raise RuntimeError("Rank 0 crashed!")
# Other ranks are now stuck at the first barrier or will hit the second
Expected behavior
The exception should propagate, and the entire distributed job should terminate gracefully. Instead, the non-crashed ranks hang, requiring a manual kill of the processes.
Code Snippet
Current implementation in ignite/distributed/utils.py:
if current_rank != rank:
barrier()
yield # <--- If an exception happens here on 'rank'
if current_rank == rank:
barrier() # <--- Other ranks never reach or pass synchronization
Bug description
In
ignite/distributed/utils.py, theone_rank_firstcontext manager is susceptible to a distributed deadlock if an exception occurs within the context block on the target rank.The current implementation uses two barriers to synchronize ranks. If the process with the designated
rankencounters an error (e.g., a network failure during data download or a disk I/O error) within theyieldblock, it will never reach the second barrier. Consequently, all other processes that are either waiting at the first barrier or expecting to synchronize at the second will hang indefinitely.Steps to reproduce
Run the following logic in a distributed environment (2+ ranks):
Expected behavior
The exception should propagate, and the entire distributed job should terminate gracefully. Instead, the non-crashed ranks hang, requiring a manual
killof the processes.Code Snippet
Current implementation in
ignite/distributed/utils.py: