What's the difference between resource requests and limits?

A **request** is the amount of CPU/memory the scheduler guarantees is reserved for a container when deciding which node to place it on — it's the minimum a container is assumed to need. A **limit** is the maximum a container is allowed to consume at runtime — exceeding a CPU limit causes throttling, while exceeding a memory limit causes the container to be killed (OOMKilled). Requests drive scheduling decisions; limits drive runtime enforcement — and they're allowed to differ, which is exactly what creates the different Quality of Service classes.

What are Quality of Service (QoS) classes in Kubernetes?

Kubernetes assigns every Pod one of three QoS classes based on how its containers' resource requests and limits relate to each other: **Guaranteed** (every container has requests equal to limits, for both CPU and memory — the strongest protection), **Burstable** (at least one container has a request set, but requests and limits aren't all equal — some protection), or **BestEffort** (no requests or limits set at all — no protection). This classification directly determines which Pods the kubelet evicts first when a node comes under memory pressure.

What is node affinity/anti-affinity, and how does it differ from pod affinity/anti-affinity?

Node affinity/anti-affinity constrains which **nodes** a Pod can be scheduled onto, based on node labels (e.g., "only nodes with an SSD," "prefer nodes in this availability zone"). Pod affinity/anti-affinity constrains scheduling based on **other Pods already running** on candidate nodes (e.g., "schedule near Pods from the same application, for locality," or "never schedule two replicas of this Pod on the same node, for availability"). Both come in a `required` (hard constraint — must be satisfied) and `preferred` (soft constraint — best-effort) variant.

What are taints and tolerations, and how do they work together?

A taint is applied to a **node**, repelling Pods from being scheduled there unless they explicitly tolerate it. A toleration is applied to a **Pod**, allowing (but not forcing) it to be scheduled onto nodes with a matching taint. This is the inverse of affinity: affinity is about a Pod attracting itself to certain nodes; taints/tolerations are about a node repelling Pods unless they've explicitly opted in — commonly used to reserve specialized nodes (GPU nodes, control-plane nodes) for only the specific workloads that need them.

What is the Horizontal Pod Autoscaler, and how does it decide when to scale?

The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of replicas in a Deployment/ReplicaSet/StatefulSet based on observed metrics (by default, average CPU or memory utilization across the Pods, but also custom or external metrics via the metrics APIs) compared against a target you configure. It periodically checks current metric values, computes the replica count needed to bring the metric back toward the target, and adjusts the controller's replica count accordingly — all without a human manually scaling anything.

What's the difference between the Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and Cluster Autoscaler?

The **Horizontal Pod Autoscaler (HPA)** changes the *number* of Pod replicas based on observed load. The **Vertical Pod Autoscaler (VPA)** changes the resource requests/limits *of individual Pods* based on their observed actual usage over time, recommending or automatically applying more appropriate CPU/memory sizing. The **Cluster Autoscaler** changes the *number of nodes* in the cluster itself, adding nodes when Pods can't be scheduled due to insufficient capacity, and removing underutilized nodes — operating one layer below the other two, at the infrastructure level rather than the Pod level.

What is Pod priority and preemption?

A PriorityClass assigns a numeric priority value to Pods; when the scheduler can't find room for a high-priority Pod, it can **preempt** (evict) lower-priority Pods on some node to free up enough resources for it, rather than leaving the high-priority Pod stuck Pending. This ensures genuinely critical workloads get scheduled even under resource contention, at the cost of the evicted lower-priority Pods needing to be rescheduled elsewhere (or wait) — it's specifically a mechanism for resolving scheduling conflicts, not a general-purpose importance label.

How does the Kubernetes scheduler decide which node to place a pod on?

The scheduler processes each unscheduled Pod through two phases: **filtering** (eliminating every node that can't satisfy the Pod's hard requirements — insufficient resources, taints without matching tolerations, failing required node/pod affinity rules) to get a list of feasible nodes, then **scoring** (ranking the remaining feasible nodes using a set of weighted priority functions — spreading Pods evenly, honoring preferred affinity, minimizing resource fragmentation, and more) to pick the single best node among those that qualified.

What happens when a pod exceeds its memory limit vs. its CPU limit?

Exceeding a **memory** limit gets the container killed immediately by the kernel's out-of-memory mechanism (OOMKilled) — memory usage can't be gracefully "slowed down," so the only enforcement mechanism is termination. Exceeding a **CPU** limit does *not* kill the container — the kernel's CFS (Completely Fair Scheduler) quota mechanism simply throttles it, allowing it less CPU time than it's trying to use, so the process keeps running, just slower, with no crash or restart involved.

Scheduling and Resource Management

Requests and limits, QoS, affinity, taints and tolerations, autoscaling, and how the scheduler places Pods.

Difficulty

Open as page

Defining both

spec:
  containers:
    - name: app
      image: myapp:1.0
      resources:
        requests:
          cpu: "250m"       # 0.25 CPU cores, guaranteed available
          memory: "256Mi"
        limits:
          cpu: "500m"        # can burst up to 0.5 CPU cores, then gets throttled
          memory: "512Mi"    # hard ceiling -- exceeding this gets the container killed

Requests — what the scheduler reserves

The scheduler only places a Pod on a node that has enough unreserved capacity to satisfy the sum of all its containers' requests — it doesn't look at a node's actual current usage, only at how much has already been requested by other Pods already scheduled there. This is a deliberate, conservative guarantee: if your Pod requests 256Mi of memory, Kubernetes guarantees a node with at least that much capacity available, regardless of what else happens to be running.

Node with 4Gi total memory, already running Pods requesting 3Gi total
   -> only 1Gi of "requestable" capacity remains, regardless of how much
      of the already-running Pods' memory is ACTUALLY being used right now
   -> a new Pod requesting 1.5Gi will NOT be scheduled here, even if the
      node's actual current usage is well under 4Gi

Limits — what's enforced at runtime

A limit is a hard ceiling enforced by the kubelet/container runtime while the container is actually running, independent of scheduling:

CPU limit exceeded: the container is throttled (its CPU time is capped via the Linux kernel's CFS quota mechanism) — it keeps running, just slower, never killed for this reason alone.
Memory limit exceeded: the container is OOMKilled (terminated by the kernel's out-of-memory mechanism) — memory can't be "throttled" the way CPU can, since there's no meaningful way to make an over-budget memory allocation just "run slower."

Why requests and limits are allowed to differ

Setting a limit higher than the request lets a container burst above its guaranteed baseline when the node happens to have spare, unreserved capacity available — useful for workloads with variable, spiky resource needs that don't need their peak usage permanently reserved. But it also means a node can be overcommitted: the sum of all limits on a node can exceed the node's actual total capacity, since not every container is expected to hit its limit simultaneously. If several containers do burst simultaneously and collectively exceed the node's real capacity, something has to give — which containers get throttled or killed first is governed by Quality of Service classes (see that question), derived directly from how requests and limits relate to each other.

What happens with no requests/limits set at all

A container with neither specified is treated as needing essentially nothing for scheduling purposes (it can be scheduled anywhere, even a nearly-full node) and has no enforced ceiling at runtime — it can consume as much of the node's resources as are physically available, potentially starving other, properly-configured workloads on the same node. This is almost always a misconfiguration in production — every container should have explicit requests (so the scheduler makes sound placement decisions) and, in most cases, limits (so one misbehaving container can't take down everything else sharing its node).

Set requests based on realistic, measured typical usage (not a guess), and set limits based on the worst-case acceptable ceiling for that workload — too tight a memory limit causes unnecessary OOMKills under normal, expected load spikes; too loose (or absent) a limit risks one container starving its node-mates. This is a tuning process that benefits from real monitoring data, not a one-time guess made at initial deployment.

Related Resources

Kubernetes: Resource Management for Pods and Containers

Open as page

The three classes, defined by requests vs. limits

Guaranteed — every container in the Pod has both CPU and memory requests and limits specified, and for every container, request equals limit:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"       # identical to request
    memory: "512Mi"   # identical to request

Burstable — at least one container has a CPU or memory request set, but the Pod doesn't meet the strict "every container, request equals limit for both resources" bar required for Guaranteed:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"        # higher than request -- allows bursting
    memory: "512Mi"    # higher than request

BestEffort — no requests or limits specified at all, for any container in the Pod:

resources: {}     # nothing set

Why this classification exists: eviction priority under memory pressure

When a node runs low on memory, the kubelet proactively evicts Pods to reclaim resources before the node becomes so overloaded it risks crashing entirely — and it doesn't evict randomly. The eviction order is: BestEffort Pods first, then Burstable Pods whose actual usage exceeds their requests (evicted in order of how far over their request they are), and Guaranteed Pods last (evicted only as an absolute last resort, since by definition they're using exactly what they requested and no more).

Node under memory pressure:
  1. Evict BestEffort Pods first (no guarantees were ever made to them)
  2. Evict Burstable Pods exceeding their requests, worst offenders first
  3. Guaranteed Pods are evicted only if the situation is still critical
     after 1 and 2 -- they were never over-consuming relative to their promise

What this means practically for workload design

Critical, latency-sensitive workloads (a production database, a payment-processing service) should be Guaranteed — setting request equal to limit trades away burst flexibility for the strongest protection against being evicted when the node is under pressure.
Typical application workloads with somewhat variable but bounded resource needs are usually Burstable — a reasonable middle ground, getting some scheduling guarantee while still allowing headroom for occasional spikes.
BestEffort should essentially never be used deliberately in production — it's what you get by forgetting to set requests/limits, not a class you should intentionally target; it offers zero protection and is the first thing sacrificed under any resource pressure.

A common mistake

Teams sometimes assume setting a generous memory limit alone is protective — but if the request is left low or unset while the limit is high, the Pod lands in Burstable (or effectively unprotected relative to its actual usage), not Guaranteed, and will be evicted before properly-configured Guaranteed Pods even if it's "only" using resources within its stated limit. QoS class is determined by the relationship between requests and limits, not by the limit's absolute value alone.

Being able to state precisely which combination of requests/limits produces each QoS class — not just the three names — and connecting that directly to eviction order under memory pressure demonstrates real operational understanding of why this classification exists, not just textbook recall.

Related Resources

Kubernetes: Pod Quality of Service Classes

Open as page

Node affinity — based on node labels

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:   # hard requirement
        nodeSelectorTerms:
          - matchExpressions:
              - key: disktype
                operator: In
                values: ["ssd"]
      preferredDuringSchedulingIgnoredDuringExecution:    # soft preference
        - weight: 80
          preference:
            matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["us-east-1a"]

This says: this Pod must land on a node labeled disktype=ssd (a hard requirement — the Pod won't be scheduled at all if no such node has room), and, among nodes satisfying that, prefer (but don't require) one in zone us-east-1a. This is a more expressive successor to the simpler nodeSelector field, supporting richer expressions (In, NotIn, Exists, etc.) and the required/preferred distinction.

Pod affinity/anti-affinity — based on co-located Pods

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values: ["cache"]
          topologyKey: "kubernetes.io/hostname"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values: ["web"]
          topologyKey: "kubernetes.io/hostname"

This example combines both: pod affinity requires this Pod to be scheduled on a node that already has a Pod labeled app=cache running on it (useful for co-locating an application with a local cache for lower latency); pod anti-affinity requires this Pod to avoid nodes that already have another Pod labeled app=web (i.e., don't put two replicas of the same "web" application on the same node — a common high-availability pattern, so a single node failure can't take down multiple replicas of the same critical service at once).

The topologyKey — defining what "together" means

topologyKey determines the granularity of "together" — kubernetes.io/hostname means "same node specifically"; topology.kubernetes.io/zone would mean "same availability zone" (a looser, region-level notion of togetherness/separation). Anti-affinity keyed on zone rather than hostname is a common pattern for spreading replicas across failure domains larger than a single node, protecting against a whole zone going down, not just one machine.

Required vs. preferred — hard vs. soft constraints

Both affinity types support requiredDuringSchedulingIgnoredDuringExecution (a hard constraint — the Pod simply won't be scheduled if it can't be satisfied) and preferredDuringSchedulingIgnoredDuringExecution (a soft, weighted preference — the scheduler tries to satisfy it, but will still schedule the Pod elsewhere if it can't). The verbose naming itself is informative: "IgnoredDuringExecution" means these rules are only checked at scheduling time — if labels change after the Pod is already running such that the rule would no longer be satisfied, the already-running Pod isn't evicted retroactively.

Why anti-affinity for high availability is a very common real pattern

podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchLabels:
            app: web
        topologyKey: "kubernetes.io/hostname"

Using preferred (rather than required) anti-affinity for spreading replicas across nodes is a common, pragmatic middle ground — you get the availability benefit of spreading replicas across different nodes/zones under normal conditions, without the risk of Pods becoming entirely unschedulable during a genuine capacity crunch where a hard requirement couldn't be satisfied (e.g., a small cluster or a zone outage leaving too few eligible nodes).

Being able to distinguish node affinity (Pod vs. node labels) from pod affinity (Pod vs. other Pods) precisely, and knowing when required vs. preferred and hostname vs. zone-level topology keys are the right choice, demonstrates real scheduling design experience beyond just knowing the YAML fields exist.

Related Resources

Kubernetes: Assigning Pods to Nodes

Open as page

Applying a taint to a node

kubectl taint nodes gpu-node-1 dedicated=gpu-workloads:NoSchedule

This taint (key=dedicated, value=gpu-workloads, effect=NoSchedule) means: no Pod will be scheduled onto this node unless it carries a matching toleration. Ordinary Pods, with no toleration specified, simply won't be placed here, even if the node has ample free CPU/memory — the taint overrides normal scheduling based purely on resource fit.

Adding a matching toleration to a Pod

spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu-workloads"
      effect: "NoSchedule"
  containers:
    - name: ml-training
      image: ml-trainer:1.0

This Pod tolerates the dedicated=gpu-workloads:NoSchedule taint, meaning it's now eligible to be scheduled on gpu-node-1 — but a toleration only removes the repulsion, it doesn't attract the Pod there. If you specifically want ML workloads to only land on GPU nodes (not merely "allowed to," but "preferentially placed there"), you'd combine this toleration with node affinity (see that question) targeting nodes labeled for GPU capability — taints/tolerations and affinity are complementary, commonly used together.

The three taint effects

Effect	Behavior for non-tolerating Pods
`NoSchedule`	New Pods won't be scheduled here; already-running Pods are unaffected
`PreferNoSchedule`	The scheduler tries to avoid placing new Pods here, but it's a soft preference, not a hard rule
`NoExecute`	New Pods won't be scheduled here, and existing Pods already running here without a matching toleration are evicted

NoExecute is the strongest effect — it doesn't just prevent future scheduling, it actively removes Pods that are already there and don't tolerate it. This is exactly the mechanism used, for example, when a node becomes NotReady — the control plane automatically applies a NoExecute taint for node-not-ready conditions, and Pods without a toleration for it are evicted after a grace period (which is itself configurable via tolerationSeconds).

Common real-world uses

Reserving specialized hardware — tainting GPU nodes so only ML/GPU-requiring workloads (which explicitly tolerate the taint) land there, keeping expensive specialized nodes from being consumed by ordinary workloads.
Control-plane node protection — control-plane nodes are commonly tainted (node-role.kubernetes.io/control-plane:NoSchedule) to keep ordinary application workloads off them by default; only specifically-tolerating Pods (often infrastructure DaemonSets — see that question) run there.
Automatic node-condition taints — Kubernetes itself automatically applies taints for conditions like node.kubernetes.io/not-ready, node.kubernetes.io/memory-pressure, and similar, which is how the system automatically starts repelling (and, for NoExecute, evicting) Pods from an unhealthy node without needing an administrator to intervene manually.

The key distinction from affinity, restated

Node affinity is something a Pod declares about which nodes it wants. Taints are something a node declares about which Pods it's willing to accept. They solve related but inverted problems, and real cluster designs commonly combine both — a taint to keep ordinary workloads off a specialized node by default, plus affinity on the specialized workload's Pods to actively steer them onto that same node.

Related Resources

Kubernetes: Taints and Tolerations

Open as page

Defining an HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60    # target: keep average CPU usage around 60% of requested CPU

This targets the web Deployment, and will scale its replica count between 2 and 10, aiming to keep average CPU utilization across all its Pods near 60% of each Pod's requested CPU (note: this is relative to the request, not the limit — which is exactly why setting sensible CPU requests, as covered in the requests/limits question, is a prerequisite for the HPA to make sensible decisions at all).

The basic algorithm

The HPA controller periodically (by default, every 15 seconds) queries the metrics API for the current average utilization across the target's Pods, and computes a desired replica count using roughly:

desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue))

If current average CPU utilization is 90% against a 60% target, with 4 current replicas: ceil(4 * (90/60)) = 6 — the HPA scales up to 6 replicas, aiming to bring the average back down toward the target once load is spread across more Pods.

Requires the metrics-server (or another metrics API) to be running

The HPA doesn't collect metrics itself — it queries the Metrics API (typically served by the metrics-server add-on for basic CPU/memory metrics, or a custom/external metrics adapter, often backed by Prometheus, for anything beyond basic resource utilization). Without metrics-server (or an equivalent) installed and functioning in the cluster, an HPA configured for CPU/memory has no data to act on and won't scale at all — a common early "why isn't autoscaling working" gotcha for a newly-set-up cluster.

Scaling on custom and external metrics

metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Beyond basic CPU/memory, the HPA (autoscaling/v2) supports scaling on custom application-level metrics (e.g., requests-per-second, queue depth) exposed through a custom metrics adapter, or external metrics from outside the cluster entirely (e.g., the depth of an external cloud message queue) — letting scaling decisions reflect the metric that actually matters most for that specific workload's real bottleneck, rather than being limited to CPU/memory alone.

Stabilization and avoiding "flapping"

The HPA includes built-in stabilization logic (configurable stabilization windows) to avoid rapidly scaling up and down in response to short-lived metric spikes — without this, a brief traffic blip could otherwise trigger a scale-up immediately followed by an equally hasty scale-down moments later, adding churn without real benefit.

Knowing that the HPA's CPU-based scaling target is relative to the Pod's requested CPU (not its limit, and not the node's total capacity) is a specific, easily-tested detail that separates surface familiarity from someone who's actually configured and tuned an HPA in practice.

Related Resources

Kubernetes: Horizontal Pod Autoscaling

Open as page

Three autoscalers, three different axes

Cluster Autoscaler   -> adds/removes NODES (infrastructure capacity)
        │
        ▼
Horizontal Pod Autoscaler  -> adds/removes REPLICAS of a given workload
        │
        ▼
Vertical Pod Autoscaler    -> resizes the REQUESTS/LIMITS of each Pod

Horizontal Pod Autoscaler (HPA) — more/fewer copies

Covered in depth in the previous question — scales the number of Pod replicas of a Deployment/StatefulSet up or down based on observed metrics. Good for stateless workloads that scale well by adding more identical, load-balanced instances.

Vertical Pod Autoscaler (VPA) — bigger/smaller individual Pods

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web
  updatePolicy:
    updateMode: "Auto"   # or "Off" (recommendation-only) / "Initial"

The VPA observes each Pod's actual historical CPU/memory usage and recommends (or, in Auto mode, actually applies) better-fitting requests/limits values — solving the very real problem of engineers guessing at reasonable request/limit values upfront and then never revisiting them as an application's actual usage pattern becomes clearer over time. Applying a VPA recommendation typically requires recreating the Pod (you can't resize a running container's cgroup limits without at least a restart, in most configurations), which is an important operational difference from the HPA's non-disruptive replica changes.

A critical incompatibility to know: running HPA (on CPU/memory metrics) and VPA on the same workload simultaneously is explicitly not recommended and can conflict — both are reacting to the same underlying resource-usage signal but taking different kinds of action (more replicas vs. resized replicas), and can end up fighting each other's decisions. If both scaling dimensions matter for a workload, VPA is often used for right-sizing on a slower, calmer cadence, while HPA (ideally on a non-resource metric like requests-per-second) handles reactive scaling.

Cluster Autoscaler — more/fewer nodes

The HPA and VPA both operate within whatever node capacity currently exists in the cluster — neither can help if there simply isn't enough total cluster capacity to schedule more or bigger Pods. The Cluster Autoscaler operates one level below both: it watches for Pods that are Pending because no existing node has room for them, and — on a supported cloud provider — provisions new nodes to accommodate them; conversely, it identifies significantly underutilized nodes and, if their workloads can be safely rescheduled elsewhere (respecting PodDisruptionBudgets — see that question), removes them to save cost.

HPA decides "I need 3 more replicas of this Deployment"
   → Scheduler tries to place them
   → No existing node has enough free capacity
   → Cluster Autoscaler notices Pods stuck Pending due to insufficient
     resources, and provisions a new node
   → Scheduler places the pending Pods onto the new node

How the three commonly work together

A well-configured cluster typically layers all three: HPA reacts quickly to load by adjusting replica counts, Cluster Autoscaler reacts to the resulting capacity needs by adjusting the number of nodes, and VPA (used more cautiously, often in recommendation-only mode, or on workloads not also using HPA) periodically right-sizes each Pod's actual requests/limits based on real observed usage — together giving a system that scales along all three axes (replica count, per-Pod size, and total node capacity) with minimal manual intervention.

Precisely articulating that these three operate on three genuinely different axes — Pod count, individual Pod size, and node count — and knowing the HPA/VPA conflict caveat, demonstrates real production autoscaling experience rather than just knowing three acronyms exist.

Related Resources

Kubernetes: Cluster Autoscaler

Open as page

Defining priority classes

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
description: "Batch jobs, safe to preempt"

spec:
  priorityClassName: high-priority
  containers:
    - name: critical-service
      image: myapp:1.0

What happens under resource contention

When the scheduler evaluates a Pod with priorityClassName: high-priority and finds no node has enough free capacity to place it, it doesn't simply leave the Pod Pending (as would happen without priority/preemption configured) — instead, it looks for a node where evicting one or more lower-priority Pods would free up enough room, and if it finds one, evicts those lower-priority Pods (respecting their termination grace periods and, ideally, their PodDisruptionBudgets, though preemption can override a PDB if truly necessary) to make room for the higher-priority Pod.

Node is full, running several "low-priority" batch Pods.
A new "high-priority" Pod can't be scheduled anywhere else.
   → Scheduler preempts (evicts) enough low-priority Pods on this node
     to free sufficient capacity
   → The high-priority Pod is scheduled into the freed space
   → The evicted low-priority Pods go back to Pending, to be rescheduled
     elsewhere (or wait) once capacity is available again

Why this matters: guaranteeing critical workloads actually run

Without priority/preemption, resource contention is resolved purely by scheduling order and luck — a critical production service could, in principle, be stuck Pending indefinitely behind a cluster full of lower-importance batch jobs that happened to claim capacity first. Priority and preemption give the scheduler an explicit mechanism to say "this matters more, and if something has to give, it should be something less important" — a real operational safeguard, not just a label for humans to read.

Important caveats

Preemption isn't guaranteed to succeed — if evicting every lower-priority Pod on every node still wouldn't free enough capacity for the high-priority Pod (e.g., it needs more resources than any single node has, period), preemption can't help, and the Pod remains Pending regardless of its priority.
Preempted Pods don't get a persistent, tracked "we owe you a slot" guarantee — they're simply evicted and go back through normal scheduling, competing for whatever capacity exists at that point (possibly getting preempted again themselves if a still-higher-priority Pod later needs the space).
A globalDefault: true PriorityClass applies automatically to any Pod that doesn't specify one explicitly — worth being deliberate about, since an unconfigured cluster otherwise treats every Pod as equal priority (effectively no preemption benefit for anything).

Reserve priority classes for genuinely meaningful tiers (e.g., "critical production," "standard production," "best-effort batch") rather than a large number of finely-graded levels that are hard to reason about — and always design PriorityClasses alongside PodDisruptionBudgets and resource requests/limits, since all three interact in determining what actually gets evicted and when under real contention.

Related Resources

Kubernetes: Pod Priority and Preemption

Open as page

Phase 1: Filtering — eliminate infeasible nodes

The scheduler starts with every node in the cluster and filters out any that can't satisfy the Pod's hard requirements:

Insufficient resources — does the node have enough unreserved CPU/memory to satisfy the Pod's requests (see the requests/limits question)?
Taints without a matching toleration — is the node tainted in a way this Pod doesn't tolerate (see that question)?
Required node affinity — does the node have the labels this Pod's requiredDuringSchedulingIgnoredDuringExecution node affinity demands?
Required pod affinity/anti-affinity — does placing this Pod here satisfy (or violate) its hard pod affinity/anti-affinity rules relative to Pods already on that node?
Volume/topology constraints — can the required storage actually be attached to this node (relevant for volumes with zone/node topology restrictions — see the StorageClass question)?
Port conflicts, node selectors, and several other basic feasibility checks.

After filtering, what remains is the set of feasible nodes — any node that could technically host this Pod. If this set is empty, the Pod stays Pending (potentially triggering the Cluster Autoscaler, or preemption, if configured — see those questions).

Phase 2: Scoring — rank the feasible nodes

Every feasible node is then scored using a set of weighted priority functions, and the highest-scoring node is chosen. Common scoring factors include:

Resource balance — preferring nodes that would end up with a more balanced ratio of CPU-to-memory usage after placement, avoiding one resource being nearly exhausted while another is idle.
Spreading — preferring to spread Pods of the same Deployment/Service across different nodes (related to, but distinct from, explicit pod anti-affinity — this is a softer, built-in default tendency).
Preferred affinity/anti-affinity — honoring preferredDuringSchedulingIgnoredDuringExecution rules, weighted by their configured weight.
Image locality — mildly preferring a node that already has the Pod's container image cached locally, avoiding a fresh image pull.

The scheduler sums these weighted scores and picks the node with the highest total — ties are broken pseudo-randomly, to avoid the scheduler always favoring the exact same node in ambiguous cases.

The scheduler only decides — it doesn't execute

Once the scheduler picks a node, it writes that decision back to the API server (setting the Pod's spec.nodeName) — it's then the kubelet on that specific node (see the fundamentals topic) that actually notices the assignment and starts the container via the runtime. The scheduler's job ends at the decision; it has no further involvement in actually running anything.

Extensibility: scheduler plugins and custom schedulers

Kubernetes's scheduler is built on a pluggable framework (the Scheduling Framework), letting organizations customize or extend filtering/scoring behavior with custom plugins, or even run an entirely separate custom scheduler for specialized workloads (a Pod can specify schedulerName to opt into a non-default scheduler) — useful for specialized scheduling needs (e.g., batch/HPC-style gang scheduling, where a whole group of Pods must be scheduled together or not at all) that the default scheduler's algorithm doesn't natively address.

Describing the two distinct phases — filter (hard, binary feasibility) then score (soft, weighted ranking) — rather than treating scheduling as one single opaque step, shows real understanding of how the scheduler actually reasons about placement, and explains why a Pod can be "feasible" on many nodes but still consistently land on a particular one.

Related Resources

Kubernetes: Kubernetes Scheduler

Open as page

Why memory and CPU are fundamentally different resources to limit

Memory is incompressible — a process either has the memory it needs allocated, or it doesn't; there's no meaningful way to give a process "half the memory, but slower." CPU, by contrast, is compressible — a process can be given less CPU time per unit of wall-clock time and simply run more slowly, without necessarily crashing. This fundamental difference is exactly why Kubernetes (via the Linux kernel's own mechanisms) enforces the two limit types completely differently.

Memory limit exceeded — OOMKilled

resources:
  limits:
    memory: "256Mi"

If a container's memory usage exceeds its limit, the Linux kernel's OOM (Out-Of-Memory) killer terminates the offending process (the container) immediately — there's no warning period, no graceful degradation. kubectl describe pod will show the container's last state as OOMKilled with a non-zero exit code (typically 137, which is 128 + SIGKILL(9)).

kubectl describe pod my-pod
# Last State:  Terminated
#   Reason:    OOMKilled
#   Exit Code: 137

Depending on the Pod's restartPolicy, the kubelet then restarts the container — if the underlying cause (a memory leak, or a limit set genuinely too low for the workload's real needs) isn't addressed, this produces a repeating cycle of OOMKill-then-restart, visible as a CrashLoopBackOff (see the troubleshooting topic).

CPU limit exceeded — throttled, not killed

resources:
  limits:
    cpu: "500m"     # 0.5 CPU cores

If a container tries to use more CPU than its limit allows, the kernel's CFS quota mechanism (which Kubernetes configures via cgroups behind the scenes) simply restricts how much CPU time the container's processes are allowed to consume within each scheduling period — the container's processes keep running, they're just allocated less CPU time than they're asking for, causing everything the container does to slow down proportionally. There's no crash, no restart, no visible error in kubectl describe pod beyond the fact that the application is simply running slower than expected.

Why this asymmetry catches people off guard

An application slowly leaking memory will eventually get OOMKilled and restarted — a visible, loud, easy-to-notice failure mode. An application that's CPU-throttled just becomes quietly, invisibly slower — often manifesting as elevated request latency or timeouts without any obviously corresponding Kubernetes-level event, which can send an on-call engineer looking everywhere except at CPU throttling metrics. Checking for CPU throttling specifically (via container_cpu_cfs_throttled_periods_total-style metrics, commonly surfaced through Prometheus/cAdvisor) is an essential, often-overlooked step when diagnosing mysterious latency issues in a containerized application, precisely because throttling produces no dramatic Kubernetes-visible failure event the way OOMKilling does.

Set memory limits carefully and somewhat generously relative to realistic peak usage, since the consequence of getting it wrong (a hard kill) is abrupt; CPU limits can be set more conservatively if bursty behavior is expected and acceptable, since the consequence of getting it wrong (throttling) is a graceful, if sometimes invisible, degradation rather than a crash — but either way, actual monitoring of both OOMKill events and CPU throttling metrics, not just guessing at reasonable-sounding numbers, is what closes the loop on whether your configured limits actually match the workload's real behavior.

Related Resources

Kubernetes: Resource Management for Pods and Containers