Troubleshooting

This section can help you detect and solve problems that you might encounter when using the IBM Storage Enabler for Containers.

Checking logs

You can use the IBM Storage Enabler for Containers logs for problem identification. To collect and display logs, related to the different components of IBM Storage Enabler for Containers, use the following Kubernetes commands:
  • Log collection – ./ubiquity_cli.sh -a collect_logs. The logs are kept in the ./ubiquity_collect_logs_MM-DD-YYYY-h:m:s folder. The folder is placed in the directory, from which the log collection command was run.
  • IBM Storage Enabler for Containers – $> kubectl logs -n ubiquity deploy/ubiquity.
  • IBM Storage Enabler for Containers database – $> kubectl logs -n ubiquity deploy/ubiquity-db.
  • IBM Storage Kubernetes Dynamic Provisioner – $> kubectl logs -n ubiquity deploy/ubiquity-k8s-provisioner.
  • IBM Storage Kubernetes FlexVolume for a pod – $> kubectl logs -n ubiquity pod ubiquity-k8s-flex<pod_ID>. In addition, events for all pods on a specific Kubernetes node are recorded in the ubiquity-k8s-flex.log file. You can view this file in the following default directory: /var/log. YYou can change this directory by configuring ubiquityK8sFlex.flexLogDir parameter in the values.yml file.
  • Controller-manager:
    • Static pod – kubectl get pods -n kube-system to display the master pod name. Then, kubectl logs -n kube-system pod_name to check the logs.
    • Non-static pod – journalctl to display the system journal. Then, search for the lines that have controller-manager entries.

Detecting errors

This is an overview of actions that you can take to pinpoint a potential cause for a stateful pod failure. The table at the end of the procedure describes the problems and provides possible corrective actions.
  1. Run the ubiquitu_cli.sh -a status_wide command to check if:
    • All Kubernetes pods are in Running state.
    • All PVCs are in Bound state.
    • ubiquity-k8s-flex pod exists on each master node in the cluster. If you have three master nodes and five worker nodes, you must see a eight ubiquity-k8s-flex pods.
  2. If you find no errors, but still unable to create or delete pods with PVCs, continue to the next step.
  3. Display the malfunctioned stateful pod ($> kubectl describe pod pod_ ID). Usually, pod description contains information about possible cause of the failure. Then, proceed with reviewing the IBM Storage Enabler for Containers logs.
  4. Display the IBM Storage Kubernetes FlexVolume log for the active master node (the node that the controller-manager is running on). Use the $> kubectl logs -n ubiquity ubiquity-k8s-flex-<pod_ID_running_on_master_node> command. As the controller-manager triggers the storage system volume mapping, the log displays details of the FlexVolume attach or detach operations.
    Additional information can be obtained from the controller-manager log as well.
  5. Review the IBM Storage Kubernetes FlexVolume log for the worker node, on which the container pod is scheduled. Use the $> kubectl logs -n ubiquity ubiquity-k8s-flex-<pod_ID_running_on_worker_node> command. As the kubelet service on the worker node triggers the FlexVolume mount and unmount operations, the log is expected to display the complete volume mounting flow.
    Additional information can be obtained from the kubelet service as well, using the $> journalctl -u kubelet command.
  6. Display the IBM Storage Enabler for Containers server log ($> kubectl logs -n ubiquity deploy/ubiquity command) or its database log ($> kubectl logs -n ubiquity deploy/ubiquity-db command) to check for possible failures.
  7. Display the IBM Storage Dynamic Provisioner log ($> kubectl logs -n ubiquity ubiquity-k8s-provisioner) to identify any problem related to volume provisioning.
  8. View the Spectrum Connect log (hsgsrv.log) for list of additional events related to the storage system and volume operations.
Table 1. Troubleshooting for IBM Storage Enabler for Containers
Description Corrective action
IBM Storage Kubernetes FlexVolume log for the active master node has no attach operations Verify that:
  • Controller-manger pod can access the Kubernetes plug-in directory. See Compatibility and requirements for instructions on configuring the access.
  • The correct hostname of the node is defined on the storage systems with the valid WWPN or IQN of the node, as described in Compatibility and requirements. This information appears in the controller-manager log.
IBM Storage Kubernetes FlexVolume log for the worker node that runs the pod has no new entries, except for ubiquitytest (Kubernetes 1.6 or 1.7 only) Restart the kubelet on Kubernetes worker and master nodes. See Performing installation.
IBM Storage Kubernetes FlexVolume log for the worker node that runs the pod contains errors, related to WWN identification in the multipath -ll output Check that:
  • Fibre Channel zoning configuration of the host is correct.
  • The Kubernetes node name is defined properly on the storage system.
  • Node rescan process was successful.
No connectivity between the FlexVolume pod and the IBM Storage Enabler for Containers server Log into the node and run the FlexVolume in a test mode ($> /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ubiquity-k8s-flex/ubiquity-k8s-flex testubiquity).

If there is an error, make sure the IP of ubiquity service is the same as configured in the ubiquity-configmap.yml file. If not, configure the IP properly, then delete the FlexVolume DeamonSet and re-create it to apply the new address value.

Failure to mount a storage volume to a Kubernetes node If the FlexVolume fails to locate a WWPN within multipath devices, verify your multipathing configuration and connectivity to a storage system. See Compatibility and requirements.
IBM Storage Enabler for Containers database fails to achieve the Running status after the configured timeout expires
  • Check the kubectl logs for the FlexVolume pod on a node where the database was scheduled to run. Verify that the mount and rescan operations were successful. Another reason might be that the Docker image pulling is taking too much time, preventing the deployment to become active.
  • Check the kubectl logs for the FlexVolume pod that runs on the master node. Check any error related to attachment of the ibm-ubiquity-db volume.
  • Check the Kubernetes scheduling. Verify that it does not exceed the timeout configured in the installation script.
  • After you resolve the issue, verify that the ibm-ubiquity-db status is Running.
IBM Storage Enabler for Containers database persists in the Creating status. In addition, the Volume has not been added to the list of VolumesInUse in the node's volume status message is stored in /var/log/message file on the node, where the database is deployed. To resolve this, move kube-controller-manager.yaml out and into /etc/kubernetes/manifests/ to be recreated the control-manager pod:
mv /etc/kubernetes/manifests/kube-controller-manager.yaml  /tmp
sleep 5
mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
sleep 15
#check the control-manager pod is running.
$> kubectl get pod -n kube-system  | grep controller-manager
# Verify it is in Running state.
Persistent volume remains in the Delete state, failing to release Review the Provisioner log ($> kubectl logs -n ubiquity deploy/ubiquity-k8s-provisioner) to identify the reason for deletion failure. Use the $ kubectl delete command to delete the volume. Then, contact the storage administrator to remove the persistent volume on the storage system itself.
Communication link between IBM Storage Dynamic Provisioner and other solution elements fails due to Provisioner token expiration IBM Storage Dynamic Provisioner uses a token that in some environments has an expiration time, for example twelve hours. To keep the link alive for an unlimited time, you can use a service-account token without expiration time. You can replace the current token with the service-account token, as follows:
$> TOKEN=$(kubectl get secret --namespace default $(kubectl get secret
--namespace default | grep service-account | awk '{print $1}') -o yaml | 
grep token: | awk '{print $2}' | base64 -d)

$> kubectl config set-credentials <mycluster.user> --token=${TOKEN}
A pod creation fails and the following error is stored in the FlexVolume log of the node intended for the pod: DEBUG 4908 executor.go:63 utils::Execute Command executed with args and error and output. [[{command=iscsiadm} {args=[-m session --rescan]} {error=iscsiadm: No session found.} {output=}]]" Verify that the node has iSCSI connectivity to the storage system. If the node has none, see the Compatibility and requirements section for instructions on how to discover and log into iSCSI targets on the storage system.
Status of a stateful pod on a malfunctioned (crashed) node is Unknown Manually recover the crashed node, as described in the Recovering a crashed Kubernetes node section.
A pod becomes unresponsive, persisting in the ContainerCreating status. The "error=command [mount] execution failure [exit status 32]" error is stored in the FlexVolume log of the node, where the pod was scheduled.

The failure occurs because the mountPoint already exists on this node. This might happen due to earlier invalid pod deletion.

Manually recover the pod, using the following procedure:
  1. Check if there is a symbolic link to the mountPoint by running $> ls -l /var/lib/kubelet/pods/<pod_ID>/volumes/ibm~ubiquity-k8s-flex/<PVC_ID>.
  2. If the file exists and there is a symbolic link to the /ubiquity/<PVC_WWN>, remove it by running rm /var/lib/kubelet/pods/<pod_ID>/volumes/ibm~ubiquity-k8s-flex/<PVC_ID>.
  3. Unmount the PV by running umount /ubiquity/<PVC_WWN>.
  4. Wait for several minutes for Kubernetes to rerun the mountFlow. Then, at the end of the process, display the FlexVolume log by running kubectl logs -n ubiquity ubiquity-k8s-flex-<pod_ID_on_the_node> to verify the Running status of the pod.
A pod becomes unresponsive, persisting in the ContainerCreating status. An error indicating a failure to discover a new volume WWN, while running the multipath -ll command, is stored in the FlexVolume log. This log belongs to the node, where the pod was scheduled. Restart the multipathd service by running the service multipathd restart command on the worker node, where the pod was scheduled.