Experiment is no longer actively maintained. For improved functionality, please use Run instead.
Only the PyTorch framework is supported distributed experiment currently.
What is a distributed experiment?
A distributed experiment is a single machine learning run on top of multi-node or multi-GPUs. The distributed experiment results are consist of logs, metrics, and artifacts for each worker which you can find under corresponding tabs.Multi-node training is not always an optimal solution. We recommend you try several experiments with a few epochs to see if multi-node training is the correct choice for you.
Environment variables
VESSL automatically sets the below environment variables based on the configuration.NUM_NODES
: Number of workers
NUM_TRAINERS
: Number of GPUs per node
RANK
: The global rank of node
MASTER_ADDR
: The address of the master node service
MASTER_PORT
: The port number on the master address
Creating a distributed experiment
Using Web Console
Running a distributed experiment on the web console is similar to a single node experiment. To create a distributed experiment, you only need to specify the number of workers. Other options are the same as those of a single node experiment.Using CLI
To run a distributed experiment using CLI, the number of nodes must be set to an integer greater than one.Examples: Distributed CIFAR
You can find the full example codes here.Step 1: Prepare CIFAR-10 dataset
Download the CIFAR dataset with the scripts below. and add a vessl type dataset to your organization.Step 2: Create a distributed experiment
To run a distributed experiment we recommend to usetorch.distributed.launch
package. The example start command that runs on two nodes and one GPU for each node is as follows.
--node_rank
, --master_addr
, --master_port
, --nproc_per_node
and --nnodes
.