How do you use auto scaling for your deployments?
Kubernetes offers the valuable functionality of horizontally scaling your deployments (also known as horizontal pod autoscaling or HPA), which we will delve into in this post. We'll explore how to configure this feature and highlight some potential pitfalls to avoid. This article is divided into two parts; the first part concentrates on autoscaling using CPU and memory metrics, while part two will focus on autoscaling with ingress-nginx and Linkerd.
Benefits of Pod Auto Scaling
Horizontal Pod Autoscaling is a powerful feature in Kubernetes that allows resources to be used more efficiently, especially during peak usage periods. It eliminates the need for manual intervention, allowing the system to automatically add or remove resources as needed, ensuring that your applications have the capacity to handle demand. This ensures that your applications remain available and performant even under high load, avoiding issues like slowdowns and unavailability.
Horizontal Pod auto scaling in Kubernetes provides several advantages, including:
- Improved performance: By automatically increasing the number of pods based on demand, auto scaling ensures that your application has the resources it needs to handle increased traffic and avoid performance issues.
- Cost efficiency: With auto scaling, you only use the resources you need, which can help reduce costs associated with over-provisioning or under-utilizing resources.
- High availability: Auto scaling ensures that your application is always available, even during periods of high traffic or unexpected demand.
- Scalability: As your application grows, auto scaling allows you to easily and efficiently increase resources to accommodate increased demand.
Requirements
- Your Kubernetes cluster must support the Metrics API. Most managed Kubernetes vendors support this out of the box, with the most common implementation being the metrics-server.
- You can validate whether your cluster supports this by running the kubectl top node command.
Output example:
Setting it up
In this section, we'll dive into a technical example of how to set up horizontal pod auto scaling using Kubernetes. We'll walk through the necessary configurations and commands to get it up and running.
Deploying our example application
Autoscaling is configured using a resource called HorizontalPodAutoscaler (HPA) in Kubernetes. HPA supports both Deployments and StatefulSets.
As an example, we start by installing a new Deployment called myapp (example.yaml
):
We apply the example.yaml into the cluster:
To simulate traffic, we deploy a load generator using the slow_cooker project from Buoyant, the company behind Linkerd. The following command creates a pod named load-generator that sends 100 requests per second to the deployed nginx with a concurrency of 10:
You can monitor various metrics, including latency, by checking the logs of the load-generator pod. After waiting for a minute or so, running kubectl top pod shows an increase in the CPU usage of the nginx pod:
Creating the HorizontalPodAutoscaler Policy
Now that we have deployed our application and generated some load, we can start configuring a HorizontalPodAutoscaler (HPA) policy. An HPA policy specifies the minimum and maximum number of replicas for a deployment or statefulset, and the metrics that should trigger scaling. In this way, the HPA ensures that the number of replicas is automatically adjusted based on the application's resource usage.
To create an HPA policy, we will use the kubectl autoscale command. We will set the minimum number of replicas to 2, the maximum number to 10, and the target CPU utilization to 50%. The target CPU utilization specifies the average CPU utilization across all pods in the deployment that the HPA should aim for.
After running this command, Kubernetes will create an HPA resource named myapp and set its CPU utilization target to 60%. The HPA will then monitor the resource usage of the myapp deployment and automatically adjust the number of replicas to maintain the target CPU utilization.
You can also output it to yaml
using the following command, for use with a GitOps pipeline approach:
HPA Scaling Behavior
It's important to note that the HPA scales based on the average utilization across all replica pods. This means that if there's one pod with high CPU utilization and another with low CPU utilization, the auto scaler may not trigger. However, if one pod has a CPU utilization of 90m and another has a utilization of 55m, the auto scaler will be triggered. Keep in mind that the auto scaling decision is based on the aggregate CPU utilization, which means that if the average utilization meets or exceeds the threshold, the scaling process will take effect.
Additionally, there is a tolerance built into the HPA scaling behavior. By default, anything within 10% of the target utilization will not trigger an autoscale (either up or down).
Scaling algorithm
You can read more about the algorithm behind HPA in the kubernetes HPA documentation.
Seeing it in action
After applying the HPA policy, there will be a slight delay before it starts to scale your deployment. This delay is due to the HPA controller collecting metrics and determining whether to scale up or down based on current utilization levels. This typically takes about a minute or so.
Shortly after, a new pod will be created:
You will notice that the cpu utilization of both pods, is now below the target utilization;
Things to be aware of when using autoscaling
A few things you will want to take into consideration when making use of auto scaling;
Cluster capacity
Ensure that your cluster has sufficient capacity to handle increased workloads, or that it supports node auto-scaling. Note that with node auto-scaling, it may take some time before new nodes are ready. Starting new pods is much faster than provisioning new nodes within a cluster. To handle an initial pod auto-scaling burst, you will need to have some capacity available while you (or your cloud provider/service provider) can provision new machines to join your cluster as nodes.
Configure reliable deployments
When using auto scaling, ensure that you have configured certain properties on your deployment to avoid connection errors during scaling operations. These properties are mostly the same as those required for supporting zero-downtime deployments. This comes down to configuring readinessProbe
and startupProbe
Using a readinessProbe
and startupProbe
helps to make sure a pod is fully able to serve traffic before it will receive any new load.
For example:
Graceful shutdowns
It is important to ensure that your application performs a graceful shutdown and waits for a certain amount of time before exiting. This is necessary to ensure that all traffic to the pod being terminated has stopped. Since Kubernetes is a distributed system, it takes some time before all systems know that a pod on another node is shutting down.
As a result, requests may still be in-flight as the pod enters the termination state. To handle this gracefully, your application needs to either handle traffic at this stage, or you could consider using a preStop
lifecycle hook, especially if your application shuts down quickly.
For upstream components, it is advisable to configure a retry mechanism for failed connections. Although retrying may not be feasible for all types of requests or transactions, it helps in increasing latency for such requests slightly.
Avoid Configuring Replicas
It is not recommended to configure the replicas field in your deployment when using HPA. Doing so will result in conflicts, and every time you perform a new deployment, the number of pods will scale up/down to the value in your Deployment. After this, the HPA controller will reset it again, resulting in unnecessary pod terminations and creations.
Instead, use the HorizontalPodAutoscaler resource to configure the minimum desired amount of replicas by setting spec.minReplicas.
Part 2 - Pod Auto scaling using linkerd
In this blog post, we have explored the concept of horizontal pod autoscaling in Kubernetes. We have discussed the benefits of pod auto scaling and the different types of metrics used for auto scaling. We have also gone through a technical example of how to set up horizontal pod auto scaling and the things to watch out for when using it.
In addition to using CPU and memory metrics, horizontal pod auto scaling in Kubernetes can be configured to use custom metrics. The popular Prometheus metrics adapter can be used to scale up/down based on requests per second, latency or any other custom metric available.
In part two of this series, we will delve into the details of how to use custom metrics for auto scaling pods. Stay tuned for the next post!