Auto scaling in Kubernetes (Part 1)

Posted March 20, 2023 by Thomas Kooi ‐ 9 min read

Learn how to configure horizontal autoscaling in Kubernetes using CPU and memory metrics, and get some tips on best practices to follow.

How do you use auto scaling for your deployments?

Kubernetes offers the valuable functionality of horizontally scaling your deployments (also known as horizontal pod autoscaling or HPA), which we will delve into in this post. We’ll explore how to configure this feature and highlight some potential pitfalls to avoid. This article is divided into two parts; the first part concentrates on autoscaling using CPU and memory metrics, while part two will focus on autoscaling with ingress-nginx and Linkerd.

Table of Content

Benefits of Pod Auto Scaling

Horizontal Pod Autoscaling is a powerful feature in Kubernetes that allows resources to be used more efficiently, especially during peak usage periods. It eliminates the need for manual intervention, allowing the system to automatically add or remove resources as needed, ensuring that your applications have the capacity to handle demand. This ensures that your applications remain available and performant even under high load, avoiding issues like slowdowns and unavailability.

Horizontal Pod auto scaling in Kubernetes provides several advantages, including:

Improved performance: By automatically increasing the number of pods based on demand, auto scaling ensures that your application has the resources it needs to handle increased traffic and avoid performance issues.
Cost efficiency: With auto scaling, you only use the resources you need, which can help reduce costs associated with over-provisioning or under-utilizing resources.
High availability: Auto scaling ensures that your application is always available, even during periods of high traffic or unexpected demand.
Scalability: As your application grows, auto scaling allows you to easily and efficiently increase resources to accommodate increased demand.

Requirements

Your Kubernetes cluster must support the Metrics API. Most managed Kubernetes vendors support this out of the box, with the most common implementation being the metrics-server.
You can validate whether your cluster supports this by running the kubectl top node command.

Output example:

$ kubectl top node
NAME                                      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-0-7.eu-west-1.compute.internal    100m         5%     1781Mi          51%
ip-10-0-0-92.eu-west-1.compute.internal   162m         8%     1927Mi          56%

Setting it up

In this section, we’ll dive into a technical example of how to set up horizontal pod auto scaling using Kubernetes. We’ll walk through the necessary configurations and commands to get it up and running.

Deploying our example application

Autoscaling is configured using a resource called HorizontalPodAutoscaler (HPA) in Kubernetes. HPA supports both Deployments and StatefulSets.

As an example, we start by installing a new Deployment called myapp (example.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: default
spec:
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: nginx
        resources:
          limits:
            memory: "128Mi"
            cpu: "100m"
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: myapp
  namespace: default
spec:
  selector:
    app: myapp
  ports:
  - port: 80
    targetPort: 80

We apply the example.yaml into the cluster:

$ kubectl apply -f example.yaml 
deployment.apps/myapp created
service/myapp created

$ kubectl get pod
NAME                     READY   STATUS        RESTARTS   AGE
myapp-58fd9b8cb7-4pf4d   1/1     Running       0          4s

To simulate traffic, we deploy a load generator using the slow_cooker project from Buoyant, the company behind Linkerd. The following command creates a pod named load-generator that sends 100 requests per second to the deployed nginx with a concurrency of 10:

kubectl run load-generator --image=buoyantio/slow_cooker -- -qps 100 -concurrency 10 http://myapp

myapp example deployment with service and load generator

You can monitor various metrics, including latency, by checking the logs of the load-generator pod. After waiting for a minute or so, running kubectl top pod shows an increase in the CPU usage of the nginx pod:

$ kubectl top pod
NAME                    CPU(cores)   MEMORY(bytes)   
load-generator          128m         5Mi             
myapp-5664749b7-bblqk   79m          2Mi

Creating the HorizontalPodAutoscaler Policy

Now that we have deployed our application and generated some load, we can start configuring a HorizontalPodAutoscaler (HPA) policy. An HPA policy specifies the minimum and maximum number of replicas for a deployment or statefulset, and the metrics that should trigger scaling. In this way, the HPA ensures that the number of replicas is automatically adjusted based on the application’s resource usage.

To create an HPA policy, we will use the kubectl autoscale command. We will set the minimum number of replicas to 2, the maximum number to 10, and the target CPU utilization to 50%. The target CPU utilization specifies the average CPU utilization across all pods in the deployment that the HPA should aim for.

kubectl autoscale deployment myapp --cpu-percent=60 --min=1 --max=10

After running this command, Kubernetes will create an HPA resource named myapp and set its CPU utilization target to 60%. The HPA will then monitor the resource usage of the myapp deployment and automatically adjust the number of replicas to maintain the target CPU utilization.

You can also output it to yaml using the following command, for use with a GitOps pipeline approach:

kubectl autoscale deployment myapp --cpu-percent=60 --min=1 --max=10 -o yaml --dry-run=client

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60

myapp example with the HPA and scaled-up replica set

HPA Scaling Behavior

It’s important to note that the HPA scales based on the average utilization across all replica pods. This means that if there’s one pod with high CPU utilization and another with low CPU utilization, the auto scaler may not trigger. However, if one pod has a CPU utilization of 90m and another has a utilization of 55m, the auto scaler will be triggered. Keep in mind that the auto scaling decision is based on the aggregate CPU utilization, which means that if the average utilization meets or exceeds the threshold, the scaling process will take effect.

Additionally, there is a tolerance built into the HPA scaling behavior. By default, anything within 10% of the target utilization will not trigger an autoscale (either up or down).

Scaling algorithm

You can read more about the algorithm behind HPA in the kubernetes HPA documentation.

Seeing it in action

After applying the HPA policy, there will be a slight delay before it starts to scale your deployment. This delay is due to the HPA controller collecting metrics and determining whether to scale up or down based on current utilization levels. This typically takes about a minute or so.

$ kubectl get hpa
NAME    REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
myapp   Deployment/myapp   75%/60%   1         10        1          51s

Shortly after, a new pod will be created:

$ kubectl get hpa
NAME    REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
myapp   Deployment/myapp   40%/70%   1         10        2          8m42s

$ kubectl describe hpa myapp
Name:                                                  myapp
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Sat, 23 Mar 2023 12:30:10 +0200
Reference:                                             Deployment/myapp
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  40% (40m) / 70%
Min replicas:                                          1
Max replicas:                                          10
Deployment pods:                                       2 current / 2 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age    From                       Message
  ----    ------             ----   ----                       -------
  Normal  SuccessfulRescale  5m26s  horizontal-pod-autoscaler  New size: 2; reason: cpu resource utilization (percentage of request) above target

You will notice that the cpu utilization of both pods, is now below the target utilization;

NAME                    CPU(cores)   MEMORY(bytes)   
load-generator          139m         5Mi             
myapp-5664749b7-bblqk   41m          2Mi             
myapp-5664749b7-lrj8g   54m          2Mi

Things to be aware of when using autoscaling

A few things you will want to take into consideration when making use of auto scaling;

Cluster capacity

Ensure that your cluster has sufficient capacity to handle increased workloads, or that it supports node auto-scaling. Note that with node auto-scaling, it may take some time before new nodes are ready. Starting new pods is much faster than provisioning new nodes within a cluster. To handle an initial pod auto-scaling burst, you will need to have some capacity available while you (or your cloud provider/service provider) can provision new machines to join your cluster as nodes.

Configure reliable deployments

When using auto scaling, ensure that you have configured certain properties on your deployment to avoid connection errors during scaling operations. These properties are mostly the same as those required for supporting zero-downtime deployments. This comes down to configuring readinessProbe and startupProbe

Using a readinessProbe and startupProbe helps to make sure a pod is fully able to serve traffic before it will receive any new load.

For example:

startupProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
readinessProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5

Graceful shutdowns

It is important to ensure that your application performs a graceful shutdown and waits for a certain amount of time before exiting. This is necessary to ensure that all traffic to the pod being terminated has stopped. Since Kubernetes is a distributed system, it takes some time before all systems know that a pod on another node is shutting down.

As a result, requests may still be in-flight as the pod enters the termination state. To handle this gracefully, your application needs to either handle traffic at this stage, or you could consider using a preStop lifecycle hook, especially if your application shuts down quickly.

For upstream components, it is advisable to configure a retry mechanism for failed connections. Although retrying may not be feasible for all types of requests or transactions, it helps in increasing latency for such requests slightly.

Avoid Configuring Replicas

It is not recommended to configure the replicas field in your deployment when using HPA. Doing so will result in conflicts, and every time you perform a new deployment, the number of pods will scale up/down to the value in your Deployment. After this, the HPA controller will reset it again, resulting in unnecessary pod terminations and creations.

Instead, use the HorizontalPodAutoscaler resource to configure the minimum desired amount of replicas by setting spec.minReplicas.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: myapp
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  # This will make sure there are always at least 3 pods in the myapp deployment
  minReplicas: 3
  maxReplicas: 10

Part 2 - Pod Auto scaling using linkerd

In this blog post, we have explored the concept of horizontal pod autoscaling in Kubernetes. We have discussed the benefits of pod auto scaling and the different types of metrics used for auto scaling. We have also gone through a technical example of how to set up horizontal pod auto scaling and the things to watch out for when using it.

In addition to using CPU and memory metrics, horizontal pod auto scaling in Kubernetes can be configured to use custom metrics. The popular Prometheus metrics adapter can be used to scale up/down based on requests per second, latency or any other custom metric available.

In part two of this series, we will delve into the details of how to use custom metrics for auto scaling pods. Stay tuned for the next post!

Tags: kubernetes autoscaling prometheus