Step 1: Communicators and Optimizers¶

In the following, we explain how to modify your code using Chainer to enable distributed training with ChainerMN. We take Chainer’s MNIST example and modify it in a step-by-step manner to see the standard way of using ChainerMN.

Creating a Communicator¶

We first need to create a communicator. A communicator is in charge of communication between workers. A communicator can be created as follows:

comm = chainermn.create_communicator()

Workers in a node have to use different GPUs. For this purpose, intra_rank property of communicators is useful. Each worker in a node is assigned a unique intra_rank starting from zero. Therefore, it is often convenient to use the intra_rank-th GPU.

The following line of code is found in the original MNIST example:

chainer.cuda.get_device_from_id(args.gpu).use()

which we modify as follows:

device = comm.intra_rank
chainer.cuda.get_device_from_id(device).use()

Creating a Multi-Node Optimizer¶

This is the most important step. We need to insert the communication right after backprop and right before optimization. In ChainerMN, it is done by creating a multi-node optimizer.

Method create_multi_node_optimizer receives a standard Chainer optimizer, and it returns a new optimizer. The returned optimizer is called multi-node optimizer. It behaves exactly same as the supplied original standard optimizer (e.g., you can add hooks such as WeightDecay), except that it communicates model parameters and gradients properly in a multi-node setting.

The following is the code line found in the original MNIST example:

optimizer = chainer.optimizers.Adam()

To obtain a multi-node optimizer, we modify that part as follows:

optimizer = chainermn.create_multi_node_optimizer(
    chainer.optimizers.Adam(), comm)

Run¶

With the above two changes, your script is ready for distributed training. Invoke your script with mpiexec or mpirun (see your MPI’s manual for details). The following is an example of executing the training with four processes at localhost:

$ mpiexec -n 4 python train_mnist.py

In the non-GPU mode, you may see a warning like shown below, but this message is harmless, and you can ignore it for now

Warning: using naive communicator because only naive supports CPU-only execution

If you have multiple GPUs on the localhost, 4 for example, you may also want to try:

$ mpiexec -n 4 python train_mnist.py --gpu

Multi-node execution¶

If you can successfully run the multi-process version of the MNIST example, you are almost ready for multi-node execution. The simplest way is to specify the --host argument to the mpiexec command. Let’s suppose you have two GPU-equipped computing nodes: host00 and host01, each of which has 4 GPUs, and so you have 8 GPUs in total:

$ mpiexec -n 8 -host host00,host01 python train_mnist.py

The script should print similar results to the previous intra-node execution.

Copying datasets¶

In the MNIST example, the rank 0 process reads the entire portion of the dataset and scatters it to other processes. In some applications, such as the ImageNet ChainerMN example, however, only the pathes to each data file are scattered and each process reads the actual data files. In such cases, all datasets must be readable on all computing nodes in the same location. You don’t need to worry about this if you use NFS (Network File System) or any other similar data synchronizing system. Otherwise, you need to manually copy data files between nodes using scp or rsync.

If you have trouble¶

If you have any trouble running the sample programs in your environment, go to the Step-by-Step Troubleshooting page and follow the steps to check your environment and configuration.

Next Steps¶

With only the above two changes distributed training is already performed. Thus, the model parameters are updated by using gradients that are aggregated over all the workers. However, this MNIST example still has a few areas in need of improvment. In the next page, we will see how to address the following problems:

Training period is wrong; ‘one epoch’ is not one epoch.
Evaluation is not parallelized.
Status outputs to stdout are repeated and annoying.