What is the Horizontal Pod Autoscaler, and how does it decide when to scale?

6 minintermediatehorizontal-pod-autoscalerautoscalingscaling

Quick Answer

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of replicas in a Deployment/ReplicaSet/StatefulSet based on observed metrics (by default, average CPU or memory utilization across the Pods, but also custom or external metrics via the metrics APIs) compared against a target you configure. It periodically checks current metric values, computes the replica count needed to bring the metric back toward the target, and adjusts the controller's replica count accordingly — all without a human manually scaling anything.

Detailed Answer

Defining an HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60    # target: keep average CPU usage around 60% of requested CPU

This targets the web Deployment, and will scale its replica count between 2 and 10, aiming to keep average CPU utilization across all its Pods near 60% of each Pod's requested CPU (note: this is relative to the request, not the limit — which is exactly why setting sensible CPU requests, as covered in the requests/limits question, is a prerequisite for the HPA to make sensible decisions at all).

The basic algorithm

The HPA controller periodically (by default, every 15 seconds) queries the metrics API for the current average utilization across the target's Pods, and computes a desired replica count using roughly:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue))

If current average CPU utilization is 90% against a 60% target, with 4 current replicas: ceil(4 * (90/60)) = 6 — the HPA scales up to 6 replicas, aiming to bring the average back down toward the target once load is spread across more Pods.

Requires the metrics-server (or another metrics API) to be running

The HPA doesn't collect metrics itself — it queries the Metrics API (typically served by the metrics-server add-on for basic CPU/memory metrics, or a custom/external metrics adapter, often backed by Prometheus, for anything beyond basic resource utilization). Without metrics-server (or an equivalent) installed and functioning in the cluster, an HPA configured for CPU/memory has no data to act on and won't scale at all — a common early "why isn't autoscaling working" gotcha for a newly-set-up cluster.

Scaling on custom and external metrics

metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Beyond basic CPU/memory, the HPA (autoscaling/v2) supports scaling on custom application-level metrics (e.g., requests-per-second, queue depth) exposed through a custom metrics adapter, or external metrics from outside the cluster entirely (e.g., the depth of an external cloud message queue) — letting scaling decisions reflect the metric that actually matters most for that specific workload's real bottleneck, rather than being limited to CPU/memory alone.

Stabilization and avoiding "flapping"

The HPA includes built-in stabilization logic (configurable stabilization windows) to avoid rapidly scaling up and down in response to short-lived metric spikes — without this, a brief traffic blip could otherwise trigger a scale-up immediately followed by an equally hasty scale-down moments later, adding churn without real benefit.

Knowing that the HPA's CPU-based scaling target is relative to the Pod's requested CPU (not its limit, and not the node's total capacity) is a specific, easily-tested detail that separates surface familiarity from someone who's actually configured and tuned an HPA in practice.