Overview

In the background, VESSL Clusters leverages GPU-accelerated Docker containers and Kubernetes pods. It abstracts the complex compute backends and system details of Kubernetes-backed GPU infrastructure into an easy-to-use web interface and simple CLI commands. Data Scientists and Machine Learning Researchers without any software or DevOps backgrounds can use VESSL’s single-line CURL command to set up and configure on-premise GPU servers for ML.

VESSL’s cluster integration is composed of four primitives.

  • VESSL API Server — Enables communication between the user and the GPU clusters, through which users can launch containerized ML workloads.
  • VESSL Cluster Agent — Sends information about the clusters and workloads running on the cluster such as the node specifications and model metrics.
  • Control plane node — Acts as the 🔗 cluster-wide control tower and orchestrates subsidiary worker nodes.
  • Worker nodes — Run specified ML workloads based on the runtime spec and environment received from the control plane node.

Integrating more powerful, multi-node GPU clusters for your team is as easy as integrating your personal laptop. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server.

Step-by-step Guide

1. Install dependencies

You can install all the dependencies required for cluster integration using a single-line curl command. The command

  • Installs 🔗 Docker if it’s not already installed.
  • Installs and configures 🔗 NVIDIA container runtime.
  • Installs 🔗 k0s, a lightweight Kubernetes distribution, and designates and configures a control plane node.
  • Generates a token and a command for connecting worker nodes to the control plane node configured above.

If you wish to use your control plane solely for the control plane node — meaning not running any ML workloads on the control plane node and only using it for admin and monitoring purposes — add a --taint-controller flag at the end of the command.

curl -sSLf https://install.dev.vssl.ai | sudo bash -s -- --role=controller

Upon installing all the dependencies, the command returns a follow-up command with a token. You can use this to add worker nodes to the control plane. If you don’t want to add an additional worker node you can skip to the next step.

curl -sSLf https://install.dev.vssl.ai | sudo bash -s -- --role worker --token '[TOKEN_HERE]'

You can confirm that your control plane and worker node have been successfully configured using a k0s command.

sudo k0s kubectl get nodes

Please try a manual installation if you encounter an error while installing with magic script.

2. Install vessl agent

First, make sure that you set the kubeconfig in your home directory.

chmod +r /var/lib/k0s/pki/admin.conf
mkdir -p ~/.kube/config
cp /var/lib/k0s/pki/admin.conf ~/.kube/config

You are now ready to integrate the Kubernetes cluster with VESSL. Make sure you have VESSL Client installed on the server and configured for your organization.

pip install vessl --upgrade
vessl configure

The following single-line command connects your Kubernetes-backed GPU cluster to VESSL. Note the —mode multi flag, specifying multi-node cluster integraiton.

vessl cluster create --name='[CLUSTER_NAME_HERE]' --mode=multi

By this point, you have successfully completed the integration.

You can use VESSL CLI command or visit 🗂️ Clusters to confirm your integration.

vessl cluster list

Destroy and delete the cluster

In order to destroy a cluster created by VESSL, follow these steps:

k0s stop
k0s reset

To complete the deletion, you may need to reboot your machine.

After destroying a cluster, you can delete it from the cluster page.

Troubleshooting