These instructions show how to prepare a node so Kommander can launch GPU services on the node.
You can prepare a node in the following ways:
Before you begin
This procedure requires the following items and configurations before you begin:
- 
Nodes must provide an Nvidia GPU. 
- 
For AWS select a GPU instance type from the Accelerated Computing section of the AWS instance types. 
- 
Nodes must run CentOS 7. 
Use Konvoy 2 on AWS
To provision GPU nodes using Konvoy 2 on AWS:
- 
Use konvoy-image-builderto create an Amazon AMI with the GPU override.konvoy-image build images/ami/centos-7.yaml --overrides overrides/nvidia.yaml
- 
Begin the Konvoy Installation up to and including the dkp create cluster awscommand.
- 
Edit the ${CLUSTER_NAME}.yamlfile:- Update the instanceTypeof the worker node pool to an instance type that provides Nvidia GPUs, for examplep2.xlarge.
- Add an ami.idto reference the image generated bykonvoy-image-builder. In this simplified example, we use AMI IDami-0d931a15fdf46f14f, you should substitute the one from the konvoy-image-builder output.
 [...] --- apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3 kind: AWSMachineTemplate metadata: [...] name: <CLUSTER_NAME>-md-0 [...] spec: template: spec: [...] instanceType: p2.xlarge [...] ami: id: ami-0d931a15fdf46f14f
- Update the 
- 
Continue the Konvoy Installation. 
Manual Deployment
For clusters not covered in the previous procedure, run the following commands on each GPU node to configure the drivers:
CentOS 7
sudo yum update -y
sudo yum -y group install "Development Tools"
sudo yum -y install kernel-devel epel-release
sudo yum -y install dkms
sudo sed -i '/^GRUB_CMDLINE_LINUX=/s/"$/ module_name.blacklist=1 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau"/' /etc/default/grub
sudo dracut --omit-drivers nouveau -f
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
lsmod | grep -i nouveau # ensure not loaded
sudo yum install -y tar bzip2 make automake gcc gcc-c++ pciutils elfutils-libelf-devel libglvnd-devel iptables firewalld vim bind-utils wget
distribution=rhel7
ARCH=$( /bin/arch )
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/${ARCH}/cuda-$distribution.repo
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo yum clean expire-cache
sudo yum install -y nvidia-driver-latest-dkms-3:460.73.01-1.el7.x86_64
Verification
Verify that the Nvidia driver is working by running:
nvidia-smi
When drivers are successfully installed the display will look like the following:
Fri Jun 11 09:05:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    73W / 149W |      0MiB / 11441MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 Kommander Documentation
Kommander Documentation