These instructions show how to prepare a node so Kommander can launch GPU services on the node.
You can prepare a node in the following ways:
Before you begin
This procedure requires the following items and configurations before you begin:
-
Nodes must provide an Nvidia GPU.
-
For AWS select a GPU instance type from the Accelerated Computing section of the AWS instance types.
-
Nodes must run CentOS 7.
Use Konvoy 2 on AWS
To provision GPU nodes using Konvoy 2 on AWS:
-
Use
konvoy-image-builder
to create an Amazon AMI with the GPU override.konvoy-image build images/ami/centos-7.yaml --overrides overrides/nvidia.yaml
-
Begin the Konvoy Installation up to and including the
konvoy create cluster aws
command. -
Edit the
${CLUSTER_NAME}.yaml
file:- Update the
instanceType
of the worker nodepool to an instance type that provides Nvidia GPUs, e.g.p2.xlarge
. - Add an
ami.id
to reference the image generated bykonvoy-image-builder
. In this simplified example, we use AMI IDami-0d931a15fdf46f14f
, you should substitute the one from the konvoy-image-builder output.
[...] --- apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 kind: AWSMachineTemplate metadata: [...] name: <CLUSTER_NAME>-md-0 [...] spec: template: spec: [...] instanceType: p2.xlarge [...] ami: id: ami-0d931a15fdf46f14f
- Update the
-
Continue the Konvoy Installation.
Manual Deployment
For clusters not covered in the previous procedure, run the following commands on each GPU node to configure the drivers:
CentOS 7
sudo yum update -y
sudo yum -y group install "Development Tools"
sudo yum -y install kernel-devel epel-release
sudo yum -y install dkms
sudo sed -i '/^GRUB_CMDLINE_LINUX=/s/"$/ module_name.blacklist=1 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau"/' /etc/default/grub
sudo dracut --omit-drivers nouveau -f
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
lsmod | grep -i nouveau # ensure not loaded
sudo yum install -y tar bzip2 make automake gcc gcc-c++ pciutils elfutils-libelf-devel libglvnd-devel iptables firewalld vim bind-utils wget
distribution=rhel7
ARCH=$( /bin/arch )
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/${ARCH}/cuda-$distribution.repo
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo yum clean expire-cache
sudo yum install -y nvidia-driver-latest-dkms-3:460.73.01-1.el7.x86_64
Verification
Verify that the Nvidia driver is working by running:
nvidia-smi
When drivers are successfully installed the display will look like the following:
Fri Jun 11 09:05:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 35C P0 73W / 149W | 0MiB / 11441MiB | 99% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+