Kubernetes cluster
Standard Kubernetes cluster alerts
Default Alerts
AME Kubernetes comes with a set of default alerts, based on the work done in the Kubernetes-mixin project and the kube-prometheus project. Avisi Cloud tweaked these alerts to fit AME Kubernetes.
This page serves as a reference for when one of these alerts fires within your cluster. Each alert gives a brief description of what it means, a list of possible causes and suggestions on how to resolve the issue.
Control plane
On AME Kubernetes, Avisi Cloud will keep an eye on the control plane related alerts. You will also not be able to scrape some components from the control plane.
KubeAPIDown [critical]
KubeAPI has disappeared from Prometheus target discovery.
No API server means all functionality in Kubernetes is frozen. Kubernetes cannot be managed at that time. Anything running at that time, will remain running as long as no other systems break.
Possible causes
- No connectivity from Prometheus to APIserver
- API server down, load balancer down.
KubeControllerManagerDown [critical]
KubeControllerManager has disappeared from Prometheus target discovery.
No controller manager means no pods can be created by deployments, deployments will freeze, etc.
Possible causes
- Controller manager is offline.
- No network connectivity between Prometheus and controller manager.
KubeSchedulerDown [critical]
KubeScheduler has disappeared from Prometheus target discovery
No scheduler means no pods can be scheduled on the cluster.
Possible causes
- Controller manager is offline.
- No network connectivity between Prometheus and controller manager.
KubeletDown [critical]
Kubelet has disappeared from Prometheus target discovery
Kubelet is offline or there is no network connectivity between Prometheus and a Kubelet.
Possible causes
- Could be an early indicator of a full node outage.
- This could also mean the kubelet just restarted and the configuration is incorrect
- The Kubelet could not reach the apiserver (in which case
KubeAPIDown
should also fire)
Workloads
Workload alerts trigger for any pod / container running within the cluster.
KubePodCrashLooping [warning]
A kubernetes pod is in a crash loop back-off state.
A crash loop means Kubernetes tried to start a pod, but it has crashed to often. After each restart, Kubernetes will increase the delay before attempting another start.
Possible causes
Some common causes could be;
- upstream dependencies such as databases are not available (e.g. do not accept new connections)
- configuration mistake for your application
- Out of memory (OOM killed)
Check;
- Examine logs to find causes. This can be done either through
kubectl logs
or using Grafana Explore. - Use
kubectl describe pod
to find causes. - Make sure network policies are correct in case your application requires connectivity to external resources
- Resource usage (most often memory)
- Check out our run book
KubePodNotReady [warning]
A pod failed to start within a reasonable amount of time.
A pod was not able to become ready. This could be because it was not scheduled, is unable to start properly or because it was evicted.
Possible causes
- Pod health check not succeeding (readinessProbe)
- pod cannot be scheduled due to scheduling constraints
- no resources left
- Evicted pods
- Image could not be found
Check;
- Use kubectl describe to find a cause of the pod. The event log should indicate the issue in most situations.
Please note, if the pod is managed through a statefulset and you have a failing health check, you may need to manually delete the pod.
KubeDaemonSetRolloutStuck [warning]
Kubernetes daemonset has pods that are outdated but not updated.
Remedation;
- Trigger a new rollout by performing
kubectl rollout restart deamonset <name>
Example:
KubeJobCompletion [warning]
A pod created by a job has taken more then an hour to run to completion
Possible causes
- Failed job pods, and the job keeps being retried.
- Job has been for a long time. The job is stuck.
Check;
- Examine logs to find causes using
kubectl logs <pod>
- Use
kubectl describe pod <pod>
to get information
Resources
KubeCPUOvercommit [warning]
cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure
Possible causes
Provision additional capacity in the cluster. Use Grafana USE dashboards as a starting point.
KubeMemOvercommit [warning]
Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure
Possible causes
Provision additional capacity in the cluster. Use Grafana USE dashboards as a starting point.
KubeCPUQuotaOvercommit [warning]
Cluster has overcommitted CPU resource requests for Namespaces
Possible causes
Adjust quota's, or investigate causes for overcommiting resources. Use Grafana USE dashboards as a starting point.
KubeMemQuotaOvercommit [warning]
Cluster has overcommitted memory resource requests for Namespaces
Possible causes
- Adjust quota's, or investigate causes for overcommiting resources.
- Use Grafana USE dashboards as a starting point.
KubeQuotaAlmostFull [info]
Possible causes
KubeQuotaFullyUsed [info]
Possible causes
KubeQuotaExceeded [warning]
Possible causes
KubePersistentVolumeFillingUp [critical]
Persistent volume claim is expected to fill up, only a small percentage of space is left available.
Action is required to avoid down time.
Possible causes
- Application or database writing to the disk
Check;
- You can view historical usage metrics in Grafana
Remedation;
- Clean up data on disk, or increase disk capacity by modifying the persistent volume claim's disk resource size. Note; this can only be done on clusters supporting CSI resizing.
KubePersistentVolumeFillingUp [warning]
Persistent volume claim is expected to fill up within four days.
Possible causes
Check;
- You can view historical usage metrics in Grafana
Remedation;
- Clean up data on disk, or increase disk capacity by modifying the persistent volume claim's disk resource size. Note; this can only be done on clusters supporting CSI resizing.
Kube system
KubeNodeNotReady [critical]
A node has been unready for more than an 15 minutes
Possible causes
A node is in unready status. This node may be offline, or has another reason for not functioning correctly.
This is a cause to investigate more closely.
Common reasons for not Ready;
- node outage
- resource pressure, such as disk, memory, etc.
Remedation;
- Kubernetes will attempt to evict resources from an unready node if it's due to resource pressure. You can validate this by using a
kubectl describe node
command and observing the node events and checks. - If it is a hardware failure, AME will auto-replace the node after 15 minutes.
- If this is a network partition issue, the situation may auto recover. Otherwise the node will be replaced by AME.
- Avisi Cloud Support should already have been notified of the issue.
KubeVersionMismatch [warning]
There are different versions of Kubernetes components running
Possible causes
Kubernetes only supports running different versions during an upgrade. During normal operations, all components should run the same version.
Check using the output of kubectl get node -o wide
and kubectl version
.
KubeClientErrors [warning]
Possible causes
KubeletTooManyPods [warning]
There are too many pods running on a node.
A node is at it's max capacity and cannot start up any new pods.
Possible causes
- Kubernetes nodes have a default limit of 110 pods per node. Move pods to a different node, allocate more cluster capacity.
Check;
- You can view the total amount of pods on a node in Grafana, using the kubelet dashboard.
- You can view all pods on a node by using
kubectl describe node
. This displays a list of pods running on this node.
Last updated on