Ensure that you read the Operations Guide prior to this document as it contains several important pieces of context.
Understanding & Troubleshooting Storage Drivers
This section provides some insights into how storage on Konvoy, using storage drivers, works at a lower level. This intended effect makes it easier for the reader to understand the components involved and more prepared to troubleshoot issues in production systems.
Some storage drivers may be significantly different from others. For the purposes of this documentation we’re covering the AWS, GCP and Azure storage drivers which come by default in a Konvoy deployment.
If you’re using another driver such as Portworx make sure to review the 3rd party upstream documentation for your solution. In the case of Portworx (a D2iQ partner) their upstream troubleshooting documentation can be found here.
Driver Structure
In Konvoy, storage solutions are often described in terms of a “driver”. Conceptually a driver may contain the following components, or something similar, depending on the implementor:
- storageclasses
- controller
- node plugin pods
Below, we cover these components in more detail using the AWS driver as an example.
Storage Class
Driver installations include their own StorageClass
(SC) resources that provide identification for which PersistentVolumeClaims
(PVC) they operate on and are responsible for.
For instance the aws-ebs-csi-driver
provides the following information:
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
awsebscsiprovisioner (default) ebs.csi.aws.com Delete WaitForFirstConsumer true 17s
When investigating storage issues you can verify which SC is actually in use for a pod by looking at the corresponding volume definitions (using an example pod):
kubectl get pod nginx-stateful-5bdc6968df-gxhgq -o=go-template='{{range .spec.volumes}}{{.persistentVolumeClaim.claimName}}{{"\n"}}{{end}}' |grep -v 'no value'
nginx-data
For each entry this output produces (you can have an arbitrary number of volumes associated with a pod, and they can each use a different SC) you can view the SC in use with:
kubectl get pvc nginx-data -o=go-template='{{.spec.storageClassName}}{{"\n"}}'
awsebscsiprovisioner
Controllers
Kubernetes Controllers are programs (generally running in a pod on the cluster) that watch API resources and react to changes in state. They are responsible for driving the current state of the cluster to the desired state.
The storage driver controller watches for new PVC resources deployed to the cluster that are configured to use its storage class. The driver responds by automatically creating and binding the PV needed to satisfy that PVC.
It is helpful to first look for the controller pod, itself, and see the pods related to the controller at a glance. Using the AWS driver as an example:
kubectl -n kube-system get pods | egrep 'ebs-csi-*controller'
ebs-csi-controller-0 6/6 Running 0 21m
ebs-csi-snapshot-controller-0 1/1 Running 0 21m
In the above example you see two controllers with two different purposes:
ebs-csi-controller-0
- the “main” controller responsible for general storage provisioningebs-csi-snapshot-controller-0
- the auxillary controller responsible for volume snapshots
When creating a PVC, you can view its events to see which controller performs actions on it:
kubectl describe pvc nginx-data
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 21m persistentvolume-controller waiting for first consumer to be created before binding
Normal ExternalProvisioning 21m persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
Normal Provisioning 21m ebs.csi.aws.com_ebs-csi-controller-0_3f916a6a-2845-4377-8987-9cea1062c02d External provisioner is provisioning volume for claim "default/nginx-data"
From the last line in the example above, the controller ebs-csi-controller-0
has begun provisioning the volume for the PVC nginx-data
.
When it completes, it displays an event that declares a Reason
for the event (such as ProvisioningSucceeded
) and a message about the event:
Normal ProvisioningSucceeded 21m ebs.csi.aws.com_ebs-csi-controller-0_3f916a6a-2845-4377-8987-9cea1062c02d Successfully provisioned volume pvc-a53be984-d4a3-4c8f-b257-99201df5de74
Controllers are responsible for setting up workflows and reporting information about tasks as they complete. The job of doing the low-level work, connecting the storage to the right pods, takes place in the “CSI Plugin” pods, documented in the following section.
Nodes
Kubernetes Nodes are often important to storage drivers as the node is where the csi-plugin
creates and connects the remote (or local) filesystems and storage devices.
Sometimes, the controller has done its job but the underlying storage is still not working properly. The answers may be in the underlying system the Kubernetes components are running on.
Failures at this level, for example, the Linux device level, or the cloud storage provider level, can grow far beyond the scope of this document. If you deployed Konvoy using AWS
, GCP
or Azure
storage drivers, refer to the cloud storage provider’s documentation to debug these issues further:
If you’re using Portworx these resources may be particularly helpful to you:
Plugin Pods
Plugin Pods are related to the “Controllers” section above, but are the actual instruments of the Container Storage Interface which connects storage to the appropriate pods.
These are generally implemented as Daemonsets which deploy a pod for every node. This supports the provisioning of storage on that node, for its storage class, and any containers needed to support that particular storage implementation.
AWS Overview
In this section we provide an overview of the AWS Driver’s node plugin pods, starting with a look at the underlying pods running on each node:
kubectl -n kube-system get daemonsets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ebs-csi-node 7 7 7 7 7 beta.kubernetes.io/os=linux 21m
kubectl -n kube-system get pods | grep ebs-csi-node
ebs-csi-node-2t4b6 3/3 Running 0 21m
ebs-csi-node-46bql 3/3 Running 0 21m
ebs-csi-node-5p82m 3/3 Running 0 21m
ebs-csi-node-7tkxt 3/3 Running 0 21m
ebs-csi-node-886wg 3/3 Running 0 21m
ebs-csi-node-wchrr 3/3 Running 0 21m
ebs-csi-node-wqx59 3/3 Running 0 21m
These pods consist of three containers, which run on the node, that coordinate the connection of storage to pods which are scheduled to that node, and have volume claims:
kubectl -n kube-system get pods ebs-csi-node-2dmp5 -o=go-template='{{range .spec.containers}}{{.name}}{{"\n"}}{{end}}'
ebs-plugin
node-driver-registrar
liveness-probe
The purpose of these containers running on each node are:
- ebs-plugin: connects to the AWS EBS API and does the API work to create and connect EBS storage to the underlying Linux system
- node-driver-registrar: a standard CSI sidecar container which connects the node’s Kubelet to the CSI driver (to dig deeper see the node-driver-registrar documentation)
- liveness-probe: a standard CSI sidecar container which monitors and reports the health of the CSI driver (to dig deeper see the liveness-probe documentation)
When troubleshooting problems in this part of the driver stack (see the “Examples” section below for additional context) problems that arise because of EBS API errors generally show up in the logs for the ebs-plugin
container. For example:
kubectl -n kube-system logs ebs-csi-node-2dmp5 ebs-plugin
I0811 20:39:19.763926 1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.5.0
I0811 20:39:19.772867 1 mount_linux.go:163] Cannot run systemd-run, assuming non-systemd OS
I0811 20:39:19.772887 1 mount_linux.go:164] systemd-run failed with: exit status 1
I0811 20:39:19.772900 1 mount_linux.go:165] systemd-run output: Failed to create bus connection: No such file or directory
I0811 20:39:19.772970 1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.5.0
panic: EC2 instance metadata is not available
At this point, the troubleshooting moves to the narrower scope of debugging the plugin and the AWS API.