What is the Horizontal Pod Autoscaler, and how does it decide when to scale?
Quick Answer
The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of replicas in a Deployment/ReplicaSet/StatefulSet based on observed metrics (by default, average CPU or memory utilization across the Pods, but also custom or external metrics via the metrics APIs) compared against a target you configure. It periodically checks current metric values, computes the replica count needed to bring the metric back toward the target, and adjusts the controller's replica count accordingly — all without a human manually scaling anything.
Detailed Answer
Defining an HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # target: keep average CPU usage around 60% of requested CPU
This targets the web Deployment, and will scale its replica count between 2 and 10, aiming to keep average CPU utilization across all its Pods near 60% of each Pod's requested CPU (note: this is relative to the request, not the limit — which is exactly why setting sensible CPU requests, as covered in the requests/limits question, is a prerequisite for the HPA to make sensible decisions at all).
The basic algorithm
The HPA controller periodically (by default, every 15 seconds) queries the metrics API for the current average utilization across the target's Pods, and computes a desired replica count using roughly:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue))
If current average CPU utilization is 90% against a 60% target, with 4 current replicas: ceil(4 * (90/60)) = 6 — the HPA scales up to 6 replicas, aiming to bring the average back down toward the target once load is spread across more Pods.
Requires the metrics-server (or another metrics API) to be running
The HPA doesn't collect metrics itself — it queries the Metrics API (typically served by the metrics-server add-on for basic CPU/memory metrics, or a custom/external metrics adapter, often backed by Prometheus, for anything beyond basic resource utilization). Without metrics-server (or an equivalent) installed and functioning in the cluster, an HPA configured for CPU/memory has no data to act on and won't scale at all — a common early "why isn't autoscaling working" gotcha for a newly-set-up cluster.
Scaling on custom and external metrics
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "1000"
Beyond basic CPU/memory, the HPA (autoscaling/v2) supports scaling on custom application-level metrics (e.g., requests-per-second, queue depth) exposed through a custom metrics adapter, or external metrics from outside the cluster entirely (e.g., the depth of an external cloud message queue) — letting scaling decisions reflect the metric that actually matters most for that specific workload's real bottleneck, rather than being limited to CPU/memory alone.
Stabilization and avoiding "flapping"
The HPA includes built-in stabilization logic (configurable stabilization windows) to avoid rapidly scaling up and down in response to short-lived metric spikes — without this, a brief traffic blip could otherwise trigger a scale-up immediately followed by an equally hasty scale-down moments later, adding churn without real benefit.
Knowing that the HPA's CPU-based scaling target is relative to the Pod's requested CPU (not its limit, and not the node's total capacity) is a specific, easily-tested detail that separates surface familiarity from someone who's actually configured and tuned an HPA in practice.