Distributed training#

Mammoth supports distributed training via DataParallel. To use it, simply pass the –distributed=dp argument to utils/main.py. This will automatically use all available GPUs on the machine using the make_dp function in distributed.

DataParallel training splits the batch across GPUs and performs the forward and backward passes on each GPU. The gradients are then averaged across GPUs and the model parameters are updated. This is the simplest form of distributed training supported by PyTorch and is the only one supported by Mammoth as of now.

Important

As of now, Mammoth only supports DataParallel training. This is due to the difficulty of synchronizing the memory buffer across multiple GPUs after each batch. However, experimental support for DistributedDataParallel training in a slurm cluster is available in the distributed module via the make_ddp function.