You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I slightly adapted the cifar10 example in this fork, basically removing python-fire and adding the torch.distributed.launch function, so that it can be executed as a standalone script with clearml-task.
I executed the following script with nproc_per_node in [1, 2, 3, 4] on a AWS g4dn.12xlarge instance (x4 T4 GPUs). I got the following results:
I am increasing the batch size by 16 each time I add a GPU, so that each GPU has the same number of samples. I didn't change the default number of processes (8) for all of them, because I didn't oberserve that the GPUs were under-used (below 95%)
GPU utilization as reported by clearml
I was expecting to observe a quasi-linear time improvement, but it isn't the case. Am I missing something?
PS: Here are the requirements I used to execute the script
I slightly adapted the cifar10 example in this fork, basically removing python-fire and adding the torch.distributed.launch function, so that it can be executed as a standalone script with clearml-task.
I executed the following script with
nproc_per_nodein [1, 2, 3, 4] on a AWS g4dn.12xlarge instance (x4 T4 GPUs). I got the following results:Here I disabled DataParallel as mentionned in DataParallel is used by auto_model with single GPU pytorch/ignite#2447
I am increasing the batch size by 16 each time I add a GPU, so that each GPU has the same number of samples. I didn't change the default number of processes (8) for all of them, because I didn't oberserve that the GPUs were under-used (below 95%)
GPU utilization as reported by clearml
I was expecting to observe a quasi-linear time improvement, but it isn't the case. Am I missing something?
PS: Here are the requirements I used to execute the script