Pod in CrashLoopBackOff

How to debug and resolve a pod that's in a CrashLoopBackOff state

Situation

A pod is in a CrashLoopBackOff state. This is either detected through kubectl get pod or for example through Kubernetes alerts.

A crash loop means Kubernetes tried to start a pod, but it has crashed to often. After each restart, Kubernetes will increase the delay before attempting another start.

Possible causes

Some common causes could be;

  • upstream dependencies such as databases are not available (e.g. do not accept new connections)
  • configuration mistake for your application
  • Out of memory (OOM killed)

Good examples of this are newly configured network policies, preventing DNS queries to coreDNS.

Diagnosis

  • Examine logs to find causes. This can be done either through kubectl logs or using Grafana Explore.
  • Use kubectl describe pod to find causes.
  • Make sure network policies are correct in case your application requires connectivity to external resources
  • Resource usage (most often memory)

View logging

Crash looping pods are often not in a running state. To see the application output from the crash, use the --previous flag:

kubectl logs mycrashlooppod --previous --tail=100

You may need to adjust the --tail flag to get more or fewer log lines.

Get events

Events are a helpful indicator to figure out if resource usage or failing health checks are causing the crash

You can either use a describe:

kubectl describe pod mycrashlooppod

Or use get events --field-select:

$ kubectl get event --field-selector involvedObject.name=nginx-9d97dbffb-rvgt2
LAST SEEN   TYPE     REASON      OBJECT                      MESSAGE
52s         Normal   Scheduled   pod/nginx-9d97dbffb-rvgt2   Successfully assigned default/nginx-9d97dbffb-rvgt2 to docker-desktop
52s         Normal   Pulled      pod/nginx-9d97dbffb-rvgt2   Container image "nginx:1.19.8" already present on machine
52s         Normal   Created     pod/nginx-9d97dbffb-rvgt2   Created container nginx
52s         Normal   Started     pod/nginx-9d97dbffb-rvgt2   Started container nginx

Remediation

DNS issues

If the logging indicate an issue with resolving a hostname (e.g. database connection url), check the following:

  • When using network policies in the pod’s namespace, make sure a network policy is in place to allow connectivity to coreDNS
  • Check for a mistake in the hostname. Note that some clusters do not use .cluster.local. Make sure the service exists.

Failing livenessProbe

If a livenessProbe is failing, it will restart the container.

If this happens during start-up, your initialDelay is configured correctly. We’d recommend to also configure a startUpProbe.

Failing livenessProbes is an indicator that your application or it’s runtime not functioning properly. Often when this is the cause, a restart of the container solves this problem. However if this occures to often, Kubernetes will start the crashLoopBackOff.

Common causes are;

  • Resource exaustion
    • To low CPU limits
    • Memory saturation