Alerts - Kubernetes cluster alerts

Standard Kubernetes cluster alerts

Default Alerts

AME Kubernetes comes with a set of default alerts, based on the work done in the Kubernetes-mixin project and the kube-prometheus project. Avisi Cloud tweaked these alerts to fit AME Kubernetes.

This page serves as a reference for when one of these alerts fires within your cluster. Each alert gives a brief description of what it means, a list of possible causes and suggestions on how to resolve the issue.

Table of Content


Control plane

💡 On AME Kubernetes, Avisi Cloud will keep an eye on the control plane related alerts. You will also not be able to scrape some components from the control plane.

KubeAPIDown [critical]

KubeAPI has disappeared from Prometheus target discovery.

No API server means all functionality in Kubernetes is frozen. Kubernetes cannot be managed at that time. Anything running at that time, will remain running as long as no other systems break.

Possible causes

  • No connectivity from Prometheus to APIserver
  • API server down, load balancer down.

KubeControllerManagerDown [critical]

KubeControllerManager has disappeared from Prometheus target discovery.

No controller manager means no pods can be created by deployments, deployments will freeze, etc.

Possible causes

  • Controller manager is offline.
  • No network connectivity between Prometheus and controller manager.

KubeSchedulerDown [critical]

KubeScheduler has disappeared from Prometheus target discovery

No scheduler means no pods can be scheduled on the cluster.

Possible causes

  • Controller manager is offline.
  • No network connectivity between Prometheus and controller manager.

KubeletDown [critical]

Kubelet has disappeared from Prometheus target discovery

Kubelet is offline or there is no network connectivity between Prometheus and a Kubelet.

Possible causes

  • Could be an early indicator of a full node outage.
  • This could also mean the kubelet just restarted and the configuration is incorrect
  • The Kubelet could not reach the apiserver (in which case KubeAPIDown should also fire)

Workloads

Workload alerts trigger for any pod / container running within the cluster.

KubePodCrashLooping [warning]

A kubernetes pod is in a crash loop back-off state.

A crash loop means Kubernetes tried to start a pod, but it has crashed to often. After each restart, Kubernetes will increase the delay before attempting another start.

Possible causes

Some common causes could be;

  • upstream dependencies such as databases are not available (e.g. do not accept new connections)
  • configuration mistake for your application
  • Out of memory (OOM killed)

Check;

  • Examine logs to find causes. This can be done either through kubectl logs or using Grafana Explore.
  • Use kubectl describe pod to find causes.
  • Make sure network policies are correct in case your application requires connectivity to external resources
  • Resource usage (most often memory)
  • Check out our run book

KubePodNotReady [warning]

A pod failed to start within a reasonable amount of time.

A pod was not able to become ready. This could be because it was not scheduled, is unable to start properly or because it was evicted.

Possible causes

  • Pod health check not succeeding (readinessProbe)
  • pod cannot be scheduled due to scheduling constraints
  • no resources left
  • Evicted pods
  • Image could not be found

Check;

  • Use kubectl describe to find a cause of the pod. The event log should indicate the issue in most situations.

Please note, if the pod is managed through a statefulset and you have a failing health check, you may need to manually delete the pod.

KubeDaemonSetRolloutStuck [warning]

Kubernetes daemonset has pods that are outdated but not updated.

Remedation;

  • Trigger a new rollout by performing kubectl rollout restart deamonset <name>

Example:

kubectl rollout restart ds my-daemonset

KubeJobCompletion [warning]

A pod created by a job has taken more then an hour to run to completion

Possible causes

  • Failed job pods, and the job keeps being retried.
  • Job has been for a long time. The job is stuck.

Check;

  • Examine logs to find causes using kubectl logs <pod>
  • Use kubectl describe pod <pod> to get information

Resources

KubeCPUOvercommit [warning]

cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure

Possible causes

Provision additional capacity in the cluster. Use Grafana USE dashboards as a starting point.

KubeMemOvercommit [warning]

Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure

Possible causes

Provision additional capacity in the cluster. Use Grafana USE dashboards as a starting point.

KubeCPUQuotaOvercommit [warning]

Cluster has overcommitted CPU resource requests for Namespaces

Possible causes

Adjust quota’s, or investigate causes for overcommiting resources. Use Grafana USE dashboards as a starting point.

KubeMemQuotaOvercommit [warning]

Cluster has overcommitted memory resource requests for Namespaces

Possible causes

  • Adjust quota’s, or investigate causes for overcommiting resources.
  • Use Grafana USE dashboards as a starting point.

KubeQuotaAlmostFull [info]

Possible causes

KubeQuotaFullyUsed [info]

Possible causes

KubeQuotaExceeded [warning]

Possible causes

KubePersistentVolumeFillingUp [critical]

Persistent volume claim is expected to fill up, only a small percentage of space is left available.

Action is required to avoid down time.

Possible causes

  • Application or database writing to the disk

Check;

  • You can view historical usage metrics in Grafana

Remedation;

  • Clean up data on disk, or increase disk capacity by modifying the persistent volume claim’s disk resource size. Note; this can only be done on clusters supporting CSI resizing.

KubePersistentVolumeFillingUp [warning]

Persistent volume claim is expected to fill up within four days.

Possible causes

Check;

  • You can view historical usage metrics in Grafana

Remedation;

  • Clean up data on disk, or increase disk capacity by modifying the persistent volume claim’s disk resource size. Note; this can only be done on clusters supporting CSI resizing.

Kube system

KubeNodeNotReady [critical]

A node has been unready for more than an 15 minutes

Possible causes

A node is in unready status. This node may be offline, or has another reason for not functioning correctly.

This is a cause to investigate more closely.

Common reasons for not Ready;

  • node outage
  • resource pressure, such as disk, memory, etc.

Remedation;

  • Kubernetes will attempt to evict resources from an unready node if it’s due to resource pressure. You can validate this by using a kubectl describe node command and observing the node events and checks.
  • If it is a hardware failure, AME will auto-replace the node after 15 minutes.
  • If this is a network partition issue, the situation may auto recover. Otherwise the node will be replaced by AME.
  • Avisi Cloud Support should already have been notified of the issue.

KubeVersionMismatch [warning]

There are different versions of Kubernetes components running

Possible causes

Kubernetes only supports running different versions during an upgrade. During normal operations, all components should run the same version.

Check using the output of kubectl get node -o wide and kubectl version.

KubeClientErrors [warning]

Possible causes

KubeletTooManyPods [warning]

There are too many pods running on a node.

A node is at it’s max capacity and cannot start up any new pods.

Possible causes

  • Kubernetes nodes have a default limit of 110 pods per node. Move pods to a different node, allocate more cluster capacity.

Check;

  • You can view the total amount of pods on a node in Grafana, using the kubelet dashboard.
  • You can view all pods on a node by using kubectl describe node. This displays a list of pods running on this node.