This section describes the Nvidia GPU support on Konvoy. This document assumes familiarity with Kubernetes GPU support. The AMD GPU is currently not supported.
Konvoy GPU Overview
GPU support on Konvoy is achieved by using the nvidia-container-runtime. With the Nvidia GPU turned on, Konvoy configures the container runtime, for running GPU containers, and installs all the necessary items to power up the Nvidia GPU devices. Konvoy runs every Nvidia GPU dependent component as daemonsets, making it easier to manage and upgrade.
The following components provide Nvidia GPU support on Konvoy:
- libnvidia-container and nvidia-container-runtime: Konvoy uses containerd as kubernetes container runtime by default. libnvidia-container and nvidia-container-runtime shim between containerd and runc, which simplifies the container runtime integration with GPU support. Another benefit is this avoids using nvidia-docker2.
- Nvidia Device Plugin: Konvoy makes use of Nvidia GPUs via this Kubernetes device plugin. It enables running GPU enabled containers on Kubernetes, tracks the number of available GPUs on each node and their health.
- NVIDIA Data Center GPU Manager: Contains a Prometheus exporter that provides Nvidia GPU metrics.
Requirements
-
CentOS 7 or Ubuntu 16.04/18.04 with the IPMI driver enabled and the Nouveau driver disabled.
-
NVIDIA GPU with Fermie architecture version 2.1 or greater.
-
Nvidia driver must be running on each GPU host node. Nvidia driver version
450.51.06-1
is recommended. Follow the official Nvidia Driver Installation Guide to setup the driver on the host. For example,
- CentOS 7
sudo yum update -y
sudo yum -y group install "Development Tools"
sudo yum -y install kernel-devel epel-release dkms
sudo sed -i '/^GRUB_CMDLINE_LINUX=/s/"$/ module_name.blacklist=1 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau"/' /etc/default/grub
sudo dracut --omit-drivers nouveau -f
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
lsmod | grep -i nouveau #ensure not loaded
sudo yum install -y tar bzip2 make automake gcc gcc-c++ pciutils elfutils-libelf-devel libglvnd-devel iptables firewalld vim bind-utils wget
distribution=rhel7
ARCH=$( /bin/arch )
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/${ARCH}/cuda-$distribution.repo
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo yum clean expire-cache
sudo yum install -y nvidia-driver-latest-dkms-3:450.51.06-1.el7.x86_64
- Ubuntu 16.04/18.04 LTS
sudo apt-get install linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin
sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
# Install the CUDA repository public GPG key. Note that on Ubuntu 16.04, replace https with http in the command below.
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
sudo apt-get update
sudo apt-get -y install cuda-drivers-460
- Verify the Nvidia driver is running on the host
[centos@ip-10-0-130-28 ~]$ nvidia-smi
Fri Jun 11 09:05:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 35C P0 73W / 149W | 0MiB / 11441MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Configuration
Enable Nvidia GPU on Konvoy
To enable Nvidia GPU support on Konvoy, add the Nvidia GPU nodePools
in ClusterProvisioner and ClusterConfiguration, then enable the nvidia
addon. Here is an example of the cluster.yaml
:
kind: ClusterProvisioner
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
provider: aws
nodePools:
- name: gpu-worker
count: 4
machine:
rootVolumeSize: 80
rootVolumeType: gp2
imagefsVolumeEnabled: true
imagefsVolumeSize: 160
imagefsVolumeType: gp2
imagefsVolumeDevice: xvdb
type: p2.xlarge
---
kind: ClusterConfiguration
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
nodePools:
- name: gpu-worker
gpu:
nvidia: {}
addons:
- configRepository: https://github.com/mesosphere/kubernetes-base-addons
configVersion: stable-1.20-4.3.0
addonsList:
- name: nvidia
enabled: true
How to prevent other workloads from running on GPU nodes
Use Kubernetes taints to ensure only dedicated workloads are deployed on GPU machines. You must add tolerations to your GPU workloads to deploy on the dedicated GPU nodes. See Kubernetes Taints and Tolerations for details.
Setting custom tolerations for an addon replaces the tolerations set in the config repository used for that addon. To add tolerations to an addon, while maintaining those set in the config repository, add the tolerations set in the config repository to your cluster.yaml
.
Here is an example of cluster.yaml
using custom taints and tolerations:
kind: ClusterProvisioner
apiVersion: konvoy.mesosphere.io/v1beta2
......
spec:
......
nodePools:
- name: gpu-worker
count: 4
machine:
rootVolumeSize: 80
rootVolumeType: gp2
imagefsVolumeEnabled: true
imagefsVolumeSize: 160
imagefsVolumeType: gp2
imagefsVolumeDevice: xvdb
type: p2.xlarge
......
---
kind: ClusterConfiguration
......
spec:
......
nodePools:
- name: gpu-worker
gpu:
nvidia: {}
labels:
- key: dedicated
value: gpu-worker
taints:
- key: dedicated
value: gpu-worker
effect: NoExecute
addons:
- configRepository: https://github.com/mesosphere/kubernetes-base-addons
configVersion: stable-1.20-4.3.0
addonsList:
......
- name: nvidia
enabled: true
values: |
nvidia-device-plugin:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-worker"
effect: "NoExecute"
nvidia-dcgm-exporter:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-worker"
effect: "NoExecute"
......
Nvidia GPU Monitoring
Konvoy uses the NVIDIA Data Center GPU Manager to export GPU metrics towards Prometheus. By default, Konvoy has a Grafana dashboard called NVIDIA DCGM Exporter Dashboard
to monitor GPU metrics. This GPU dashboard is shown in Konvoy’s Grafana UI.
Upgrade
Konvoy is capable of automatically upgrading the Nvidia GPU addon. However, GPU workload needs to be drained if the Nvidia addon is being upgraded.
-
Delete all GPU workloads on the GPU nodes where the Nvidia addon needs to be upgraded.
-
Delete the existing Nvidia addon:
kubectl delete clusteraddon nvidia
-
Wait for all Nvidia-related resources in the
Terminating
state to be cleaned up. You can check pod status with:kubectl get pod -A | grep nvidia
-
Specify the desired
configVersion
in yourcluster.yaml
. Then, deploy addons to upgrade the Nvidia GPU addon:konvoy deploy addons
Debugging
-
Determine if all Nvidia pods are in
Running
state, as expected:kubectl get pod -A | grep nvidia
-
If there are any Nvidia pods crashing, returning errors, or flapping, collect the logs for the problematic pod. For example:
kubectl logs -n kube-system nvidia-kubeaddons-nvidia-device-plugin-dxtch
-
To recover from this problem, you must restart all Nvidia-addon pods that are running on the SAME host. In the example below, all Nvidia resources are restarted on the node
ip-10-0-129-201.us-west-2.compute.internal
:$ kubectl get pod -A -o wide | grep nvidia kube-system nvidia-kubeaddons-nvidia-device-plugin-dxtch 1/1 Running 0 4m20s 192.168.57.153 ip-10-0-129-191.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-device-plugin-j4dm2 1/1 Running 0 4m20s 192.168.39.88 ip-10-0-128-134.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-device-plugin-qb29b 1/1 Running 0 4m20s 192.168.119.35 ip-10-0-128-208.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-8ngx9 2/2 Running 0 4m20s 192.168.57.154 ip-10-0-129-191.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6 2/2 Running 0 4m20s 192.168.243.100 ip-10-0-129-201.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-ttjqs 2/2 Running 0 4m20s 192.168.39.89 ip-10-0-128-134.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-xqj6r 2/2 Running 0 4m20s 192.168.119.36 ip-10-0-128-208.us-west-2.compute.internal <none> <none> $ kubectl delete pod -n kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6 pod "nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6" deleted $ kubectl delete pod -n kube-system nvidia-kubeaddons-nvidia-device-plugin-tsbk2 pod "nvidia-kubeaddons-nvidia-device-plugin-tsbk2" deleted
-
To collect more debug information on the Nvidia addon, run:
helm get nvidia-kubeaddons
For a full list of attributed 3rd party software, see D2IQ Legal.