This guide describes how to replace a failed worker node. If you need to replace a control plane node, go to replace a failed control plane node.
Whenever a node crashes due to system failure, Konvoy provides some guidelines to replace it with a new worker.
Initially you should ensure the node is accessible from the kubectl
command-line interface by running the following command:
kubectl get nodes
If the node is in the Ready
state, it should be drained, i.e., move all the running pods away from that node prior its destruction:
kubectl drain <worker-node-id>
It may be possible that an error will be shown indicating that draining may not be possible but can continue with certain flags, e.g.
$ kubectl drain ip-10-0-128-170.us-west-2.compute.internal
node/ip-10-0-128-170.us-west-2.compute.internal cordoned
error: unable to drain node "ip-10-0-128-170.us-west-2.compute.internal", aborting command...
There are pending nodes to be drained:
ip-10-0-128-170.us-west-2.compute.internal
cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-8j5qz, kube-system/ebs-csi-node-ft6b6, kube-system/kube-proxy-9vpxs, kubeaddons/fluentbit-kubeaddons-fluent-bit-k5q7f, kubeaddons/prometheus-kubeaddons-prometheus-node-exporter-xbpsn
cannot delete Pods with local storage (use --delete-local-data to override): kubeaddons/alertmanager-prometheus-kubeaddons-prom-alertmanager-0, kubeaddons/prometheusadapter-kubeaddons-prometheus-adapter-6b8975fc48d6h2b, velero/velero-kubeaddons-5d85fcdcb9-762ll
in those cases you can continue by appending the flags hinted at
kubectl drain ip-10-0-128-170.us-west-2.compute.internal --ignore-daemonsets --delete-local-data
If, on the other hand, the node is not in the Ready
state, you may skip the draining process.
Next, you need to destroy that node to be able to provision a new node. The procedure depends on whether you have provisioned via a Cloud-Provider or running machines On-Prem.
Cloud-Provider
AWS
When provisioning a cluster using Konvoy
on AWS, you can manually delete the machine via the AWS web console.
Once the machine has terminated, the following command will reconcile the state of the cluster and provision a new node.
konvoy up
That command will reconcile the state of the cluster and configure the new node.
Now this new worker node should appear Ready
when running kubectl get nodes
.
On-Prem
When provisioning your own infrastructure via an inventory.yaml
file, you can manually delete the node by removing it from the inventory.yaml
file.
Next you should add the new worker node entry in the inventory file and run:
konvoy up
That command will install all the pre-requisites and Kubernetes components to join the cluster and start allocating pods in this node.