Batch runs are designed to execute a series of commands defined in your YAML configuration and then terminate. Batch job is suitable for large-scale, long-running tasks. These tasks are powered by the robustness of GPU capabilities, which significantly hasten model training times.
A Simple Batch Run
Here is an example of a simple batch run YAML configuration. It specifies Docker image to be used, the resource required for the run, and the commands to be exectued during the run.
name: gpu-batch-run description: Run a GPU-backed batch run. image: quay.io/vessl-ai/ngc-pytorch-kernel:22.10-py3-202306140422 resources: cluster: aws-apne2 preset: v1.v100-1.mem-52 run: - command: | nvidia-smi
In this example, the
resources.preset=v1.v100-1.mem-52 will request a V100 GPU instance. Next, the
nvidia-smi command will be executed to display the
NVIDIA system management inteface and then terminate the run.
You can also define termination protection in a batch run. Termination protection keeps your run active for a specified duration even after your commands have finished executing. This can be usefrul for debugging or retrieving intermediate files.
name: gpu-batch-run description: Run a GPU-backed batch run. image: quay.io/vessl-ai/ngc-pytorch-kernel:22.10-py3-202306140422 resources: cluster: aws-apne2 preset: v1.v100-1.mem-52 run: - command: | nvidia-smi termination_protect: true
In this example, the
termination_protect will protect the container termination after running
Train a Thin-Plate Spline Motion Model with GPU resource
Now let’s dive in more complex batch run configuration. This configuration file describes a batch run for training a Thin-Plate Spline Motion Model utilizing a V100 GPU.
name: Thin-Plate-Spline-Motion-Model description: "Animate your own image in the desired way with a batch run on VESSL." image: nvcr.io/nvidia/pytorch:21.05-py3 resources: cluster: aws-apne2 preset: v1.v100-1.mem-52 run: - workdir: /root/examples/thin-plate-spline-motion-model command: | pip install -r requirements.txt python run.py --config config/vox-256.yaml --device_ids 0 import: /root/examples: git://github.com/vessl-ai/examples /root/examples/vox: s3://vessl-public-apne2/vessl_run_datasets/vox/
In this batch run, the Docker image
nvcr.io/nvidia/pytorch:21.05-py3 is used, and a V100 GPU (
resources.preset=v1.v100-1.mem-52) is allocated for the run. This will ensure that the training job runs on top of the V100 GPU.
The model and scripts used in this run are fetched from a Github repository (
The commands executed in the run first install the requriements, and train the model using the
This example demonstrates how you can set up a batch run for GPU-backed training a machine learning model with a single YAML configuration.
For more advanced configurations and examples. please visit VESSL Hub.
A variatey of YAML examples that you can use as references