VESSL Clusters comes with a built-in cluster dashboard that provides a visualization of cluster usage and status down to each node and workload. This is enabled by the VESSL Cluster Agent which sends real-time information about the clusters and workloads running on the cluster such as node specifications and model metrics.
The dashboard is automatically set up when you integrate your cloud or on-premises servers using the vessl cluster create
command.
Users on the Enterprise plan can use the customized VESSL Cluster Agent to route the monitoring information to your monitoring tools like Datadog and Grafana. Contact us at support@vessl.ai to get more details.
Multi-cluster monitoring of resource usage and ongoing workloads is available under Clusters. Here, you can get an overview of the integrated clusters.
Clicking the cluster guides you to the Overview tab which holds more detailed information about the cluster.
The Cluster status overview section presents the basic information about the cluster including the connection and incident status.
The section contains the following information:
Total node: Shows all nodes.
Available node: Indicates the number of nodes you can use.
Failed node: Displays the nodes that are in a failed status.
“Failed node” detailed explanation and actions
A “Failed node” refers to a node where the network communication between the Kubernetes Control Plane and the kubelet is disrupted, leaving its status unknown. Since communication errors can occur due to various reasons, identifying the root cause requires direct inspection of the node.
Steps to take:
The cluster administrator should inspect the node by checking:
The kubelet logs (The debugging feature is included in the Logs page)
The node’s status
The network connectivity
If you need the information about communication between nodes and the control plane, please refer to Kubernetes’ official documentation.
Quotas & Usage shows the organization-wide and personal resource quota for the cluster, including the number of GPU hours and occupiable GPUs and CPUs. This is set by the organization admin. Refer to our next section in the documentation VESSL Cluster’s features on cluster administration.
This section shows you how much CPU, GPU, and memory have been requested (and allocated) and are currently being used.
This section shows all ongoing workloads on the cluster with information on the type, status, occupying node, resource, creator, and the created date.
Under Nodes, you can view all the worker nodes tied to the cluster with their name, status, real-time CPU, memory, disk and GPU usage, ongoing workloads by their type, and overall health status (Healthy).
By clicking the each node name, you can get more in-depth information.
Under Workloads, you can view the workload log related to the cluster with the current status, occupying node, resource consumption, and a visualization of the usage history. If you are an organization admin, clicking the workload name guides you to the detailed workload page under Project or Workspace.