Batch Run

Batch runs are designed to execute a series of commands defined in your YAML configuration and then terminate. Batch job is suitable for large-scale, long-running tasks. These tasks are powered by the robustness of GPU capabilities, which significantly hasten model training times.

A Simple Batch Run

Here is an example of a simple batch run YAML configuration. It specifies Docker image to be used, the resource required for the run, and the commands to be exectued during the run.

Simple batch run definition
name: gpu-batch-run
description: Run a GPU-backed batch run.
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
resources:
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small
run:
  - command: |
      nvidia-smi

In this example, the resources.preset=v1.v100-1.mem-52 will request a V100 GPU instance. Next, the nvidia-smi command will be executed to display the NVIDIA system management inteface and then terminate the run.

Termination Protection

You can also define termination protection in a batch run. Termination protection keeps your run active for a specified duration even after your commands have finished executing. This can be usefrul for debugging or retrieving intermediate files.

Enable termination protection
name: gpu-batch-run
description: Run a GPU-backed batch run.
image: quay.io/vessl-ai/torch:2.3.1-cuda12.1-r5
resources:
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small
run:
  - command: |
      nvidia-smi
termination_protect: true

In this example, the termination_protect will protect the container termination after running nvidia-smi command.

Train a Thin-Plate Spline Motion Model with GPU resource

Now let’s dive in more complex batch run configuration. This configuration file describes a batch run for training a Thin-Plate Spline Motion Model utilizing a V100 GPU.

Batch run YAML for training Thin-Plate Spline Motion Model
name: Thin-Plate-Spline-Motion-Model
description: "Animate your own image in the desired way with a batch run on VESSL."
image: nvcr.io/nvidia/pytorch:21.05-py3
resources:
  cluster: vessl-gcp-oregon
  preset: gpu-l4-small
run:
  - workdir: /root/examples/deprecated/thin-plate-spline-motion-model
    command: |
      pip install -r requirements.txt
      python run.py --config config/vox-256.yaml --device_ids 0
import:
  /root/examples: git://github.com/vessl-ai/examples
  /root/examples/vox: s3://vessl-public-apne2/vessl_run_datasets/vox/

In this batch run, the Docker image nvcr.io/nvidia/pytorch:21.05-py3 is used, and a V100 GPU (resources.preset=v1.v100-1.mem-52) is allocated for the run. This will ensure that the training job runs on top of the V100 GPU.

The model and scripts used in this run are fetched from a Github repository (/root/examples: git://github.com/vessl-ai/examples).

The commands executed in the run first install the requriements, and train the model using the run.py script.

This example demonstrates how you can set up a batch run for GPU-backed training a machine learning model with a single YAML configuration.

What’s Next

For more advanced configurations and examples. please visit VESSL Hub.

VESSL Hub

A variatey of YAML examples that you can use as references