Recovering a crashed Kubernetes node
This section details a manual operation required to revive Kubernetes pods that reside on a crashed node due to an existing Kubernetes limitation.
Identifying a crashed node
When a worker node shuts down or crashes, all pods in a StatefulSet that reside on it become unavailable. In these scenarios, the node status is NotReady, and the pod status appears as Terminating.
For
example:
$> kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 6d v1.14.1
k8s-node1 Ready <none> 6d v1.14.1
k8s-node3 NotReady <none> 6d v1.14.1
$> kubectl get pods --all-namespaces -o wide | grep default
default sanity-statefulset-0 1/1 Terminating 0 19m 10.244.2.37 k8s-node3
Recovering a crashed node
Follow the following procedure to recover from a crashed node (see a full example at the end of
the procedure):
- Find for the volumeattachment of the created
pod:
$> kubectl get volumeattachment - Copy the volumeattachment name.
- Delete the
volumeattachment:
$> kubectl delete volumeattachment <volumeattachment name> - Delete the
pod:
$> kubectl delete pod <pod name> --grace-period=0 --force - Verify that the pod is now in a Running state and that the pod has moved to worker-node1.
$> kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 6d v1.14.1
k8s-node1 Ready <none> 6d v1.14.1
k8s-node3 NotReady <none> 6d v1.14.1
$> kubectl get pods --all-namespaces -o wide | grep default
default sanity-statefulset-0 1/1 Terminating 0 19m 10.244.2.37 k8s-node3
$> kubectl get volumeattachment
NAME AGE
csi-5944e1c742d25e7858a8e48311cdc6cc85218f1156dd6598d4cf824fb1412143 10m
$> kubectl delete volumeattachment csi-5944e1c742d25e7858a8e48311cdc6cc85218f1156dd6598d4cf824fb1412143
volumeattachment.storage.k8s.io "csi-5944e1c742d25e7858a8e48311cdc6cc85218f1156dd6598d4cf824fb1412143" deleted
$> kubectl delete pod sanity-statefulset-0 --grace-period=0 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "sanity-statefulset-0" deleted
$> kubectl get pods --all-namespaces -o wide | grep default
default sanity-statefulset-0 1/1 Running 0 26s 10.244.1.210 k8s-node1