Replace a failed control plane node
-
Confirm the failed control plane node is not communicating with other nodes.
kubectl get nodes --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"
NAME READY konvoy-example-cluster-control-plane-0 True konvoy-example-cluster-control-plane-1 Unknown konvoy-example-cluster-control-plane-2 True
If the node’s
READY
column does not sayTrue
, then the node is not ready. In the above example, thekonvoy-example-cluster-control-plane-1
node is not ready. -
Permanently remove the failed node.
For example, if the node is an AWS EC2 instance, use the AWS CLI or Console to terminate the instance.
-
Identify an etcd member ready to accept etcd API requests.
kubectl -n kube-system get pod --selector=tier=control-plane,component=etcd --output=custom-columns="NAME":".metadata.name","READY":".status.conditions[?(@.type==\"Ready\")].status"
NAME READY etcd-konvoy-example-cluster-control-plane-0 True etcd-konvoy-example-cluster-control-plane-1 False etcd-konvoy-example-cluster-control-plane-2 True
In the above example, both
etcd-konvoy-example-cluster-control-plane-0
andetcd-konvoy-example-cluster-control-plane-2
are ready. Choose one and note (or copy) its name. -
Identify the failed etcd member.
Find the ID of the etcd member for the failed control plane node.
READY_ETCD_MEMBER="<name of etcd member from previous step>" ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://127.0.0.1:2379" kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member list"
1d021ffdd096a804, started, konvoy-example-cluster-control-plane-1, https://172.17.0.6:2380, https://172.17.0.6:2379, false 40fd14fa28910cab, started, konvoy-example-cluster-control-plane-0, https://172.17.0.4:2380, https://172.17.0.4:2379, false 87651970646a8073, started, konvoy-example-cluster-control-plane-2, https://172.17.0.5:2380, https://172.17.0.5:2379, false
In the above example, the failed control node is
konvoy-example-cluster-control-plane-1
, so the etcd ID is1d021ffdd096a804
. Note, or copy, this ID. -
Remove the failed etcd member.
READY_ETCD_MEMBER="<name of etcd member from previous steps>" ETCD_ID_TO_REMOVE="<etcd member ID from previous step>" ETCDCTL="ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt --endpoints=https://127.0.0.1:2379" kubectl -n kube-system exec -it "$READY_ETCD_MEMBER" -- /bin/sh -c "$ETCDCTL member remove $ETCD_ID_TO_REMOVE"
Member 1d021ffdd096a804 removed from cluster a6ea9ad1b116d02f
-
Delete the failed Node from the Kubernetes API.
kubectl delete node konvoy-example-cluster-control-plane-1
-
Create a new control plane node to replace the failed node.
konvoy up