Kubernetes - Horizontal and Vertical Pod Autoscaling



Autoscaling is a powerful Kubernetes feature that helps us ensure our applications remain responsive and cost-efficient by adjusting the number of pods or resources they consume based on demand. In this chapter, we’ll explore two types of autoscaling in Kubernetes: Horizontal Pod Autoscaling and Vertical Pod Autoscaling.

What is Horizontal Pod Autoscaling?

Horizontal Pod Autoscaling (HPA) automatically increases or decreases the number of pod replicas in a deployment or replicaset based on observed CPU/memory usage or custom metrics.

Imagine we have a web service handling fluctuating traffic. Instead of manually scaling the number of pods, HPA can help us scale it up when traffic spikes and scale down when it's idle.

Prerequisites for HPA

Before we can use HPA, make sure:

  • The Metrics Server is deployed and running in the cluster.

Why the Metrics Server is important

  • Metrics Server collects CPU/memory usage from each pod via Kubelets.
  • It exposes these metrics through the Kubernetes metrics.k8s.io API.
  • HPA queries this API periodically.
  • If usage crosses the configured threshold (e.g., 50% CPU), HPA adjusts the number of replicas accordingly.

To install the Metrics Server:

$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Output

service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

Then verify it's working:

$ kubectl get deployment metrics-server -n kube-system

Output

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
metrics-server   1/1     1            0           42s

Using Horizontal Pod Autoscaler (HPA)

Let’s walk through a simple example of setting up HPA for an NGINX deployment.

Create a Deployment

Create a file called nginx-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: 100m
limits:
cpu: 200m

Apply the deployment:

kubectl apply -f nginx-deployment.yaml

Output

deployment.apps/nginx-deployment created

Create the HPA

Now we’ll create a Horizontal Pod Autoscaler that adjusts between 2 and 5 pods based on CPU usage:

$ kubectl autoscale deployment nginx-deployment --cpu-percent=50 --min=2 --max=5

Output

horizontalpodautoscaler.autoscaling/nginx-deployment autoscaled

This command tells Kubernetes to maintain an average CPU utilization of 50% across all pods.

Monitor the Autoscaler

Use the following command to check the current status:

$ kubectl get hpa

Output

NAME               REFERENCE                     TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
nginx-deployment   Deployment/nginx-deployment   65%/50%     2         5         3          5m

To test it, we can generate CPU load on the pods using a stress tool like busybox or a custom load generator.

Generate CPU Load to Trigger HPA

We'll run a busybox pod and use it to continuously hit the NGINX service. This will indirectly increase CPU usage and trigger scaling via HPA.

Expose the NGINX Deployment

We’ll start by creating a ClusterIP service so we can target it with load:

$ kubectl expose deployment nginx-deployment --port=80 --target-port=80

Output

service/nginx-deployment exposed

Create the Load Generator (busybox)

Now we’ll create a pod that repeatedly curls the NGINX service to cause CPU load.

Create a file named busybox-load.yaml and the following lines:

apiVersion: v1
kind: Pod
metadata:
name: busybox-load
spec:
containers:
- name: busybox
image: busybox
command: ["sh", "-c", "while true; do wget -q -O- http://nginx-deployment; done"]

Apply it:

$ kubectl apply -f busybox-load.yaml

Output

pod/busybox-load created

This will continuously send HTTP requests to the NGINX service, keeping the pods busy.

Monitor HPA Reaction

Watch how the autoscaler responds:

$ kubectl get hpa -w

Output

NAME               REFERENCE                     TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
nginx-deployment   Deployment/nginx-deployment   65%/50%     2         5         2          4m22s

This output confirms:

  • Our application (nginx-deployment) is under high CPU load — 65% on average.
  • The autoscaler saw that usage exceeds the target (50%), so it scaled the number of pods from 2 to 3.
  • HPA is working as expected — it dynamically adjusted our pod count to meet CPU demand.

What is Vertical Pod Autoscaling?

While HPA adjusts the number of pods, Vertical Pod Autoscaling (VPA) adjusts the CPU and memory requests and limits of individual pods. This is useful for applications with unpredictable memory or CPU requirements where simply adding pods doesn’t help.

VPA is ideal for batch jobs or services that can’t scale horizontally.

Enabling Vertical Pod Autoscaling

VPA isn’t installed by default in most clusters. Here’s how we can install it using the official YAMLs:

$ git clone https://github.com/kubernetes/autoscaler.git
$ cd autoscaler/vertical-pod-autoscaler

Output

Cloning into 'autoscaler'...
remote: Enumerating objects: 219519, done.
remote: Counting objects: 100% (1546/1546), done.
remote: Compressing objects: 100% (1095/1095), done.
remote: Total 219519 (delta 972), reused 453 (delta 451), pack-reused 217973 (from 4)
Receiving objects: 100% (219519/219519), 245.21 MiB | 14.42 MiB/s, done.
Resolving deltas: 100% (141772/141772), done.
Updating files: 100% (7924/7924), done

Apply the Custom Resource Definitions (CRDs)

$ kubectl apply -f deploy/vpa-crd.yaml

Output

customresourcedefinition.apiextensions.k8s.io/verticalpodautoscalers.autoscaling.k8s.io created
customresourcedefinition.apiextensions.k8s.io/verticalpodautoscalercheckpoints.autoscaling.k8s.io created

This defines the VPA objects (VerticalPodAutoscaler) that Kubernetes can understand.

Apply the RBAC Permissions

$ kubectl apply -f vpa-rbac.yaml

Output

clusterrole.rbac.authorization.k8s.io/vpa-system-role created
clusterrolebinding.rbac.authorization.k8s.io/vpa-system-binding created

This grants the necessary permissions for the VPA components to function securely.

Deploy the VPA Components

Apply the Recommender deployment:

$ kubectl apply -f deploy/recommender-deployment.yaml

Output

deployment.apps/vpa-recommender created

Apply the Admission Controller Deployment

$ kubectl apply -f deploy/admission-controller-deployment.yaml

Output

deployment.apps/vpa-admission-controller created
service/vpa-webhook created

Apply the Updater Deployment

$ kubectl apply -f deploy/updater-deployment.yaml

Output

deployment.apps/vpa-updater created

Verify that the VPA Components are Running

$ kubectl get pods -n kube-system | grep vpa

Output

vpa-admission-controller-6d5b66d7f5-mtx5g   1/1     Running   0          5m
vpa-recommender-6b87f56c7b-q84g5           1/1     Running   0          5m
vpa-updater-56f6f777d8-2knrj               1/1     Running   0          5m

Using Vertical Pod Autoscaler (VPA)

Let’s create a sample deployment and attach a VPA configuration.

Create a Deployment

Create a file vpa-demo.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
name: vpa-demo
spec:
replicas: 1
selector:
matchLabels:
app: vpa-demo
template:
metadata:
labels:
app: vpa-demo
spec:
containers:
- name: app
image: vish/stress
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 200m
memory: 200Mi
args:
- -cpus
- "2"

Apply the deployment:

kubectl apply -f vpa-demo.yaml

Output

deployment.apps/vpa-demo created

Create the VPA Object

Now we’ll create the VPA configuration (vpa.yaml):

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: vpa-demo
spec:
targetRef:
apiVersion: "apps/v1"
kind:       Deployment
name:       vpa-demo
updatePolicy:
updateMode: "Auto"

Apply it:

$ kubectl apply -f vpa.yaml

Output

verticalpodautoscaler.autoscaling.k8s.io/vpa-demo created

Monitor VPA Recommendations

After applying the VPA configuration, the VPA will start monitoring and recommending resource adjustments for the deployment. If the updateMode is set to Auto, the VPA will automatically apply the recommendations during pod restarts.

To view the VPA recommendations, run:

$ kubectl describe vpa vpa-demo

Output

Name:         vpa-demo
Namespace:    default
Target Ref:   Deployment/vpa-demo
Update Policy: Auto
Recommendations:
Container: app
Target:
CPU:        100m
Memory:     100Mi
Current:
CPU:        200m
Memory:     200Mi
Utilization:
CPU:        50%
Memory:     50%

The recommendations show the target and current resource requests and limits, as well as the utilization of CPU and memory. Based on this, VPA will make suggestions and may modify the resource requests during a pod restart.

When to Use HPA vs VPA

Use Case HPA VPA
Web services under traffic Yes Sometimes useful
Long-running jobs No Ideal
Resource tuning for workloads No Yes
Fast scaling needs Best choice Not optimal

You can also combine both (with care) by setting HPA for replicas and VPA in "Off" mode to only get recommendations.

Conclusion

In this tutorial, we’ve learned how Kubernetes helps us scale workloads using Horizontal and Vertical Pod Autoscaling.

  • HPA adjusts the number of pods based on CPU, memory, or custom metrics.
  • VPA automatically adjusts pod resource requests and limits.

Both autoscaling types are crucial tools in building a resilient, cost-effective, and responsive infrastructure. Try them out in your test environments, and start tuning your workloads the smart way.

Advertisements