API Reference¶

Communicators¶

chainermn.create_communicator(communicator_name='hierarchical', mpi_comm=None)¶

Create a ChainerMN communicator.

Different communicators provide different approaches of communication, so they have different performance charasteristics. The default communicator hierarchical is expected to generally perform well on a variety of environments, so one need not to change communicators in most cases. However, choosing proper communicator may give better performance. The following communicators are available.

Name	CPU	GPU	NCCL	Recommended Use Cases
pure_nccl		OK	Required (>= v2)	`pure_nccl` is recommended when NCCL2 is available in the environment.
hierarchical		OK	Required	Each node has a single NIC or HCA
two_dimensional		OK	Required	Each node has multiple NICs or HCAs
single_node		OK	Required	Single node with multiple GPUs
flat		OK		N/A
naive	OK	OK		Testing on CPU mode

Parameters:	communicator_name – The name of communicator (`naive`, `flat`, `hierarchical`, `two_dimensional`, `pure_nccl`, or `single_node`) mpi_comm – MPI4py communicator
Returns:	ChainerMN communicator

Optimizers and Evaluators¶

chainermn.create_multi_node_optimizer(actual_optimizer, communicator, double_buffering=False)¶

Create a multi node optimizer from a Chainer optimizer.

Parameters:

actual_optimizer – Chainer optimizer (e.g., chainer.optimizers.Adam).
communicator – ChainerMN communicator.
double_buffering – If True, all-reduce and other processing (such as forward and backward) are overlapped using double buffering. There are cases where accuracy is affected because the gradients of the previous iteration are used for update. This flag is supported by PureNcclCommunicator only.

Returns:

The multi node optimizer based on actual_optimizer.

chainermn.create_multi_node_evaluator(actual_evaluator, communicator)¶

Create a multi node evaluator from a normal evaluator.

Actually this method patches the evaluator to work in multi node environment. This method adds several hidden attributes starting with _mn_ prefix.

Parameters:	actual_evaluator – evaluator to be patched (e.g., `chainer.training.extensions.Evaluator`) communicator – ChainerMN communicator
Returns:	The multi-node patched `actual_evaluator`.

Note

After patched, original evaluator does not work correctly in non-MPI environment.

Dataset Utilities¶

chainermn.scatter_dataset(dataset, comm, root=0, shuffle=False, seed=None, max_buf_len=268435456)¶

Scatter the given dataset to the workers in the communicator.

The dataset of worker 0 (i.e., the worker whose comm.rank is 0) is scattered to all workers. The given dataset of other workers are ignored. The dataset is split to sub datasets of almost equal sizes and scattered to workers. To create a sub dataset, chainer.datasets.SubDataset is used.

Parameters:

dataset – A dataset (e.g., list, numpy.ndarray, chainer.datasets.TupleDataset, …).
comm – ChainerMN communicator or MPI4py communicator.
shuffle (bool) – If True, the order of examples is shuffled before being scattered.
root (int) – The root process of the scatter operation.
seed (int) – Seed the generator used for the permutation of indexes. If an integer being convertible to 32 bit unsigned integers is specified, it is guaranteed that each sample in the given dataset always belongs to a specific subset. If None, the permutation is changed randomly.
max_buf_len (int) – Max buffer size to be used at broadcasting binaries. Must not be larger than 2147483647.

Returns:

Scattered dataset.

chainermn.datasets.create_empty_dataset(dataset)¶

Creates an empty dataset for models with no inputs and outputs.

This function generates an empty dataset, i.e., __getitem__() only returns None. Its dataset is compatible with the original one. Such datasets used for models which do not take any inputs, neither return any outputs. We expect models, e.g., whose forward() is starting with chainermn.functions.recv() and ending with chainermn.functions.send().

Parameters:	dataset – Dataset to convert.
Returns:	Dataset consists of only patterns in the original one.
Return type:	TransformDataset

Links¶

class chainermn.MultiNodeChainList(comm)¶

Combining multiple non-connected components of computational graph.

This class combines each chainer.Chain, which represents one of the non-connected component in compuational graph. In __call__(), the returned object of chainer.Chain (which represents pointer) are passed to the next chainer.Chain, in order to retain the computational graph connected and make backprop work properly.

Users add each chainer.Chain by add_link() method. Each chain is invoked in forward computation according to the order they are added, and in backward computation according to the reversed order.

Example (basic usage)

This is a simple example of the model which sends its outputs to rank=1 machine:

import chainer
import chainer.functions as F
import chainermn


class SimpleModelSub(chainer.Chain):

    def __init__(self, n_in, n_hidden, n_out):
        super(SimpleModelSub, self).__init__(
            l1=L.Linear(n_in, n_hidden),
            l2=L.Linear(n_hidden, n_out))

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        return self.l2(h1)


class SimpleModel(chainermn.MultiNodeChainList):

    def __init__(self, comm, n_in, n_hidden, n_out):
        super(SimpleModel, self).__init__(comm)
        self.add_link(
            SimpleModelSub(n_in, n_hidden, n_out),
            rank_in=None,
            rank_out=1)

Example (split MLP on 2 processes)

This is the other example of two models interacting each other:

import chainer
import chainer.functions as F
import chainermn


class MLP(chainer.Chain):

    def __init__(self, n_in, n_hidden, n_out):
        super(MLP, self).__init__(
            l1=L.Linear(n_in, n_hidden),
            l2=L.Linear(n_hidden, n_hidden),
            l3=L.Linear(n_hidden, n_out))

    def __call__(self, x):
        h1 = F.relu(self.l1(x))
        h2 = F.relu(self.l2(h1))
        return self.l3(h2)


class Model0(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(Model0, self).__init__(comm)
        self.add_link(
            MLP(10000, 5000, 2000),
            rank_in=None,
            rank_out=1)
        self.add_link(
            MLP(100, 50, 10),
            rank_in=1,
            rank_out=None)


class Model1(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(Model1, self).__init__(comm)
        self.add_link(MLP(2000, 500, 100), rank_in=0, rank_out=0)

Model0 is expected to be on rank=0, and Model1 is expected to be on rank=1. The first MLP in Model0 will send its outputs to Model1, then MLP in Model1 will receive it and send its outputs to the second MLP in Model0.

Example (sending tuples)

This is the example for sending a tuple:

import chainer
import chainer.functions as F
import chainermn

class NN0(chainer.Chain):
    def __call__(self, x):
        y0 = some_calculation_nn0_0(x)
        y1 = some_calculation_nn1_1(x)
        return y0, y1

class NN1(chainer.Chain):
    def __call__(self, y):
        y0, y1 = y  # unpack tuple from NN0
        return some_calculation_nn1(y0, y1)

class Model_on_Process_0(chainermn.MultiNodeChainList):
    def __init__(self, comm):
        super(Model_on_Process_0, self).__init__(comm=comm)
        self.add_link(NN0(), rank_in=None, rank_out=1)

class Model_on_Process_1(chainermn.MultiNodeChainList):
    def __init__(self, comm):
        super(Model_on_Process_1, self).__init__(comm=comm)
        self.add_link(NN1(), rank_in=0, rank_out=None)

In this example, Model_on_Process_0 sends two elemental tuple (y0, y1) (returned by NN0.__call__) to Model_on_Process_1, which can be unpacked as shown in NN1.__call__.

Parameters:	comm (chainermn.communicators._base.CommunicatorBase) – ChainerMN communicator.

add_link(link, rank_in=None, rank_out=None)¶

Register one connected link with its inout rank.

Parameters:	link (chainer.Link) – The link object to be registered. rank_in (int, list, or None) – Ranks from which it receives data. If None is specified, the model does not receive from any machines. rank_out (int, list, or None) – Ranks to which it sends data. If None is specified, the model will not send to any machine.

class chainermn.links.MultiNodeBatchNormalization(size, comm, decay=0.9, eps=2e-05, dtype=<type 'numpy.float32'>, use_gamma=True, use_beta=True, initial_gamma=None, initial_beta=None)¶

Batch normalization layer that can use the whole batch stats.

When using chainer.link.BatchNormalization, batch mean and std are computed independently for the local batch in each worker. When local batch size is too small, training is unstable due to unreliable batch stats.

In contrast, when using this MultiNodeBatchNormalization, workers communicate to conduct ‘correct’ batch normalization (e.g., obtaining mean and std for the whole global batch).

This link works only with Chainer >= 2.0.0.

Parameters:

size (int or tuple of ints) – Size (or shape) of channel dimensions.
comm (ChainerMN communicator) – communicator to share the batch stats.
decay (float) – Decay rate of moving average. It is used on training.
eps (float) – Epsilon value for numerical stability.
dtype (numpy.dtype) – Type to use in computing.
use_gamma (bool) – If True, use scaling parameter. Otherwise, use unit(1) which makes no effect.
use_beta (bool) – If True, use shifting parameter. Otherwise, use unit(0) which makes no effect.

Functions¶

chainermn.functions.send(x, communicator, rank, tag=0)¶

Send elements to target process.

This function returns a dummy variable only holding the computational graph. If backward() is invoked by this dummy variable, it will try to receive gradients from the target process and send them back to the parent nodes.

Parameters:	x (Variable) – Variable holding a matrix which you would like to send. communicator (chainer.communicators.CommunicatorBase) – ChainerMN communicator. rank (int) – Target process specifier. tag (int) – Optional message ID (MPI feature).
Returns:	A dummy variable with no actual data, only holding the computational graph. Please refer `chainermn.functions.pseudo_connect` for detail.
Return type:	Variable

chainermn.functions.recv(communicator, rank, delegate_variable=None, tag=0, device=-1, force_tuple=False)¶

Receive elements from target process.

This function returns data received from target process. If backward() is invoked, it will try to send gradients to the target process.

Note

If you define non-connected computational graph on one process, you have to use delegate_variable to specify the output of previous computational graph component. Otherwise backward() does not work well. Please refer chainermn.functions.pseudo_connect for detail.

Parameters:	communicator (chainer.communicators.CommunicatorBase) – ChainerMN communicator. rank (int) – Target process specifier. delegate_variable (chainer.Variable) – Pointer to the other non-connected component. tag (int) – Optional message ID (MPI feature). device (int) – Target device specifier. force_tuple (bool) – If `False` (the default) a Variable will be returned when the number of outputs is one. Otherwise, this method returns a tuple even when the number of outputs is one.
Returns:	Data received from target process. If `backward()` is invoked by this variable, it will send gradients to the target process.
Return type:	Variable

chainermn.functions.pseudo_connect(delegate_variable, *actual_variables)¶

Connect independent connected graph component.

This function is implemented to return received arguments directly, except the first delegate_variable. In backward computation, it returns received gradients directly, adding a zero grad corresponding to delegate_variable. The detail of delegate_variable is described in the following notes.

Note

In model-parallel framework, models on each process might have many non-connected components. Here we call a given graph non-connected when multiple inter-process communications are needed for its computation. For example, consider the following example:

class ConnectedGraph(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(ConnectedGraph, self).__init__(comm)
        self.add_link(ConnectedGraphSub(), rank_in=3, rank_out=1)

This model receives inputs from rank=3 process and sends its outputs to rank=1 process. The entire graph can be seen as one connected component ConnectedGraphSub. Please refer the document of MultiNodeChainList for detail.

On the other hand, see the next example:

class NonConnectedGraph(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(NonConnectedGraph, self).__init__(comm)
        self.add_link(NonConnectedGraphSubA(), rank_in=3, rank_out=1)
        self.add_link(NonConnectedGraphSubB(), rank_in=1, rank_out=2)

This model consists of two components: at first, NonConnectedGraphSubA receives inputs from rank=3 process and sends its outputs to rank=1 process, and then NonConnectedGraphSubB receives inputs from rank=1 process and sends its outputs to rank=2 process. Here multiple inter-process communications are invoked between NonConnectedGraphSubA and NonConnectedGraphSubB, so it is regarded as non-connected.

Such kind of non-connected models can be problematic in backward computation. Chainer traces back the computational graph from the output variable, however naive implementation of chainermn.functions.recv does not take any inputs rather receives inputs by MPI_Recv, where backward path vanishes.

To prevent this, dummy variables what we call delegate_variable are used. In principle, chainermn.functions.send does not return any outputs because it sends data to the other process by MPI_Send. However, chainermn.functions.send returns a dummy / empty variable in our implementation, which is called delegate_variable. This variable does not hold any data, just used for retaining backward computation path. We can guarantee the backward computation just by putting delegate_variable to the next chainermn.functions.recv (chainermn.functions.recv has an optional argument to receive delegate_variable).

Note

In some cases the intermediate graph component returns model outputs. See the next example:

class NonConnectedGraph2(chainermn.MultiNodeChainList):

    def __init__(self, comm):
        super(NonConnectedGraph2, self).__init__(comm)
        self.add_link(NonConnectedGraphSubA(), rank_in=1, rank_out=None)
        self.add_link(NonConnectedGraphSubB(), rank_in=None, rank_out=1)

This model first receives inputs from rank=1 process and make model outputs (specified by rank_out=None) in NonConnectedGraphSubA. Then using model inputs (specified by rank_in=None), NonConnectedGraphSubB sends its outputs to rank=1 process. Since MultiNodeChainList.__call__ returns outputs of the last component (in this case, outputs of NonConnectedGraphSubB), naive implementation cannot output the returned value of NonConnectedGraphSubA as the model outputs. In this case, pseudo_connect should be used.

pseudo_connect takes two arguments. The first one delegate_variable is what we explained in above note. In this case, returned value of NonConnectedGraphSubB corresponds to delegate_variable. The second one actual_variables is “what we want delegate_variable to imitate”. In NonConnectedGraph2, we obtain returned value of NonConnectedGraphSubB as the model outputs, but what we actually want is returned value of NonConnectedGraphSubA. At the same time we want to trace back this resulted variable in backward computation. Using pseudo_connect, we can make a variable whose data is the same as the returned value of NonConnectedGraphSubA, and which traces back NonConnectedGraphSubB first.

pseudo_connect should also be used in some pathological cases, for example, where multiple chainermn.functions.send occurs sequentially.

Parameters:	delegate_variable (chainer.Variable) – Pointer to the previous non-connected graph component. actual_variables (tuple of chainer.Variable) – Actual values which `delegate_variable` imitate.
Returns:	A variable with the given values combined with delegating variable.
Return type:	Variable

chainermn.functions.all_to_all(comm, xs, device=-1)¶

Differentiable all-to-all communication between workers.

This function invokes all-to-all communications among processes specified by the communicator. Backward will be invoked as well as the ordinary chainer functions, just passing input gradients back. Unlike point-to-point communication such as chainermn.functions.send and chainermn.functions.recv, users need not to care about delegate variables, since backward() will not be invoked until all gradients from output direction arrive. Please refer to chainermn.functions.pseudo_connect about the detail of delegate variables.

Parameters:	comm – ChainerMN communicator. xs (list of chainer.Variables) – Variables to send. device (int) – Target device specifier.
Returns:	Received variables. d: A delegate variable.
Return type:	ys (list of chainer.Variables)

Trainer extensions¶

class chainermn.extensions.AllreducePersistent(model, comm)¶

Chainer extension to averagize persistents over workers.

When called, this extension invokes all-reduce communication among workers to compute averages of persistent variables in the model. Persistent variables are updated to the averages. Currently, we ignore integer persistent variables, and only float persistent variables are handled.

This extension is mainly to improve the running mean and variance of BatchNormalization by increasing the effective number of examples. We do not need to call this frequently; call just before storing or evaluating the model.

Parameters:	model (chainer.link.Link) – Target link object. comm (ChainerMN communicator) – communicator to compute averages.

chainermn.create_multi_node_checkpointer(name, comm, cp_interval=5, gc_interval=5, path=None)¶

Create multi-node checkpointer object

Generational snapshot extension to allow fault tolerance; It keeps several old snapshots to rollback synchronized snapshot at each MPI process. Snapshot files are identified as ‘<name>.<rank>.<iteration>’.

<name> … identifier of the run where snapshot is kept for
<rank> … which process owned the model
<iteration> … number of iteration.

This extension keeps several files for each execution and allows users to resume the whole job at the latest snapshots of each MPI process, and the iteration where all snapshots agrees.

As this object is a usual Chainer extension, users can just create this object and pass to the trainer as an extension:

checkpointer = create_multi_node_checkpointer(name=run_id, comm=comm)
trainer.extend(checkpointer, trigger=(25, 'iteration'))

To run recovery at startup, before first iteration, run

checkpointer.maybe_load(trainer, optimizer)

before trainer.run() . If nothing is recovered (i.e. no snapshot found), trainer.updater.iteration will remain 0 . Otherwise it will have the value of snapshot and the training will resume from that iteration. optimizer is optional but this will let multi node optimizer avoid initial broadcast when all snapshot data among nodes are all in sync.

After training finished without errors all those temporary checkpoints will be cleaned up at all nodes.

Another example to use checkpointer without trainer would be:

checkpointer = create_multi_node_checkpointer(name=run_id, comm=comm)
checkpointer.maybe_load(obj_you_want_to_snap, optimizer)

while True: ## Training loop
    ...
    updater.update()
    ...
    checkpointer.save(obj_you_want_to_snap)  # Make a checkpoint

Parameters:	name (str) – unique id of the run comm – communicater in ChainerMN cp_interval (int) – minimum number of checkpoints to preserve gc_interval (int) – interval to collect non-preserved checkpoints