Recovering a crashed Kubernetes node
This section details a manual operation required to revive Kubernetes pods that reside on a crashed node.
Identifying a crashed node
When a worker node shuts down or crashes, all stateful pods that reside on it become unavailable,
and the node status appears as NotReady.
# kubectl get nodes
NAME STATUS AGE VERSION
kuber-node1 Ready 2h v1.7.5
kuber-node2 NotReady 2h v1.7.5
kuber-serv1 Ready 2h v1.7.5
When this node status persists for more than five minutes (default setting, see note below for
instructions on how to change this value), the following occurs:
- Status of a pod scheduled on the pod becomes Unknown.
- The new pod is scheduled on another node in the cluster with status
ContainerCreating, denoting that the pod is scheduled on a crashed node.
As a result, the pod scheduled on a crashed node appears twice on two nodes with two statuses, as illustrated below.
# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE sanity-deployment-2414-538d2 1/1 Unknown 0 15m IP_address kuber-node2 sanity-deployment-2414-n8cfv 0/1 ContainerCreating 0 34s <none> kuber-node1
Note: The time period between the node failure and creation of a new pod on another node
is user-configurable. Use the following procedure to change the
pod-eviction-timeout value:
|
Recovering a crashed node
To allow Kubernetes to recover the stateful pods from a crashed node and schedule them on a
functional node in the cluster:
- Remove the crashed node from the cluster to free up all its pods (kubectl delete node
<node_name>),
or
- Force delete the stateful pods, which are in Unknown state (kubectl delete
pods <pod_name> --grace-period=0 --force -n <namespace>).
After the mandatory five-minute timeout, as set by Kubernetes itself, the pod runs on a scheduled node. The pod status changes from ContainerCreating to Running. See example below for the sanity-deployment-2414-n8cfv pod.
If the crashed node recovers by itself or the user reboots the node, no additional actions are required to release its pods. The pods recover automatically after the node restores itself and joins the cluster. When a crashed node is recovered, the following occurs:
- The pod with the Unknown status is deleted.
- The volume(s) is detached from the crashed node.
- The volume(s) is attached to node, on which the new pod is scheduled.
- After the mandatory five-minute timeout, as set by Kubernetes itself, the pod runs on a scheduled node. The pod status changes from ContainerCreating to Running. See example below for the sanity-deployment-2414-n8cfv pod.
# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
sanity-deployment-2414-n8cfv 1/1 Running 0 8m IP_address kuber-node1