In the background, VESSL Clusters leverages GPU-accelerated Docker containers and Kubernetes pods. It abstracts the complex compute backends and system details of Kubernetes-backed GPU infrastructure into an easy-to-use web interface and simple CLI commands. Data Scientists and Machine Learning Researchers without any software or DevOps backgrounds can use VESSL’s single-line CURL command to set up and configure on-premise GPU servers for ML.
VESSL’s cluster integration is composed of four primitives.
VESSL API Server — Enables communication between the user and the GPU clusters, through which users can launch containerized ML workloads.
VESSL Cluster Agent — Sends information about the clusters and workloads running on the cluster such as the node specifications and model metrics.
Worker nodes — Run specified ML workloads based on the runtime spec and environment received from the control plane node.
Integrating more powerful, multi-node GPU clusters for your team is as easy as integrating your personal laptop. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server.
OS: Ubuntu 18.04+ / Centos 7.9+
Each machine should be connected to internet
Nodes reachable to following port/protocols (refer to this k0s link)
Protocol
Port
Service
Direction
Notes
TCP
2380
etcd peers
controller <-> controller
TCP
6443
kube-apiserver
Worker, CLI => controller
Authenticated Kube API using Kube TLS client certs, ServiceAccount tokens with RBAC
TCP
179
kube-router
worker <-> worker
BGP routing sessions between peers
UDP
4789
Calico
worker <-> worker
Calico VXLAN overlay
TCP
10250
kubelet
Master, Worker => Host *
Authenticated kubelet API for the master node kube-apiserver
TCP
9443
k0s-api
controller <-> controller
k0s controller join API, TLS with token auth
TCP
8132
konnectivity
worker <-> controller
Konnectivity is used as “reverse” tunnel between kube-apiserver and worker kubelets
For NVIDIA GPUs, you must install CUDA and NVIDIA drivers. We strongly recommend to use CUDA >= 11.1 and NVIDIA Driver >= 450.80.02.
Here is the command you can use to check CUDA and NVIDIA driver versions.
Installs 🔗 k0s, a lightweight Kubernetes distribution, and designates and configures a control plane node.
Generates a token and a command for connecting worker nodes to the control plane node configured above.
If you wish to use your control plane solely for the control plane node — meaning not running any ML workloads on the control plane node and only using it for admin and monitoring purposes — add a --taint-controller flag at the end of the command.
Upon installing all the dependencies, the command returns a follow-up command with a token. You can use this to add worker nodes to the control plane. If you don’t want to add an additional worker node you can skip to the next step.
To test if a setup works, run the following container:
docker run --rm--gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Install k0s
k0s is an open source, all-inclusive Kubernetes distribution, which is configured with all of the features needed to build a Kubernetes cluster.
VESSL recommends installing Kubernetes in on-premise machines using k0s.
You are now ready to integrate the Kubernetes cluster with VESSL. Make sure you have VESSL Client installed on the server and configured for your organization.
pip install vessl --upgrade
vessl configure
The following single-line command connects your Kubernetes-backed GPU cluster to VESSL. Note the —mode multi flag, specifying multi-node cluster integraiton.
Sometimes your workloads without GPU requests can have the access to all machine GPUs.
It is a native bug in NVIDIA Docker and Kubernetes, you can find the how to guide from NVIDIA in here.
TLDR, you can use this command to fix.