Defining both
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
cpu: "250m" # 0.25 CPU cores, guaranteed available
memory: "256Mi"
limits:
cpu: "500m" # can burst up to 0.5 CPU cores, then gets throttled
memory: "512Mi" # hard ceiling -- exceeding this gets the container killed
Requests — what the scheduler reserves
The scheduler only places a Pod on a node that has enough unreserved capacity to satisfy the sum of all its containers' requests — it doesn't look at a node's actual current usage, only at how much has already been requested by other Pods already scheduled there. This is a deliberate, conservative guarantee: if your Pod requests 256Mi of memory, Kubernetes guarantees a node with at least that much capacity available, regardless of what else happens to be running.
Node with 4Gi total memory, already running Pods requesting 3Gi total
-> only 1Gi of "requestable" capacity remains, regardless of how much
of the already-running Pods' memory is ACTUALLY being used right now
-> a new Pod requesting 1.5Gi will NOT be scheduled here, even if the
node's actual current usage is well under 4Gi
Limits — what's enforced at runtime
A limit is a hard ceiling enforced by the kubelet/container runtime while the container is actually running, independent of scheduling:
- CPU limit exceeded: the container is throttled (its CPU time is capped via the Linux kernel's CFS quota mechanism) — it keeps running, just slower, never killed for this reason alone.
- Memory limit exceeded: the container is OOMKilled (terminated by the kernel's out-of-memory mechanism) — memory can't be "throttled" the way CPU can, since there's no meaningful way to make an over-budget memory allocation just "run slower."
Why requests and limits are allowed to differ
Setting a limit higher than the request lets a container burst above its guaranteed baseline when the node happens to have spare, unreserved capacity available — useful for workloads with variable, spiky resource needs that don't need their peak usage permanently reserved. But it also means a node can be overcommitted: the sum of all limits on a node can exceed the node's actual total capacity, since not every container is expected to hit its limit simultaneously. If several containers do burst simultaneously and collectively exceed the node's real capacity, something has to give — which containers get throttled or killed first is governed by Quality of Service classes (see that question), derived directly from how requests and limits relate to each other.
What happens with no requests/limits set at all
A container with neither specified is treated as needing essentially nothing for scheduling purposes (it can be scheduled anywhere, even a nearly-full node) and has no enforced ceiling at runtime — it can consume as much of the node's resources as are physically available, potentially starving other, properly-configured workloads on the same node. This is almost always a misconfiguration in production — every container should have explicit requests (so the scheduler makes sound placement decisions) and, in most cases, limits (so one misbehaving container can't take down everything else sharing its node).
Set requests based on realistic, measured typical usage (not a guess), and set limits based on the worst-case acceptable ceiling for that workload — too tight a memory limit causes unnecessary OOMKills under normal, expected load spikes; too loose (or absent) a limit risks one container starving its node-mates. This is a tuning process that benefits from real monitoring data, not a one-time guess made at initial deployment.
Related Resources
The three classes, defined by requests vs. limits
Guaranteed — every container in the Pod has both CPU and memory requests and limits specified, and for every container, request equals limit:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # identical to request
memory: "512Mi" # identical to request
Burstable — at least one container has a CPU or memory request set, but the Pod doesn't meet the strict "every container, request equals limit for both resources" bar required for Guaranteed:
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m" # higher than request -- allows bursting
memory: "512Mi" # higher than request
BestEffort — no requests or limits specified at all, for any container in the Pod:
resources: {} # nothing set
Why this classification exists: eviction priority under memory pressure
When a node runs low on memory, the kubelet proactively evicts Pods to reclaim resources before the node becomes so overloaded it risks crashing entirely — and it doesn't evict randomly. The eviction order is: BestEffort Pods first, then Burstable Pods whose actual usage exceeds their requests (evicted in order of how far over their request they are), and Guaranteed Pods last (evicted only as an absolute last resort, since by definition they're using exactly what they requested and no more).
Node under memory pressure:
1. Evict BestEffort Pods first (no guarantees were ever made to them)
2. Evict Burstable Pods exceeding their requests, worst offenders first
3. Guaranteed Pods are evicted only if the situation is still critical
after 1 and 2 -- they were never over-consuming relative to their promise
What this means practically for workload design
- Critical, latency-sensitive workloads (a production database, a payment-processing service) should be Guaranteed — setting request equal to limit trades away burst flexibility for the strongest protection against being evicted when the node is under pressure.
- Typical application workloads with somewhat variable but bounded resource needs are usually Burstable — a reasonable middle ground, getting some scheduling guarantee while still allowing headroom for occasional spikes.
- BestEffort should essentially never be used deliberately in production — it's what you get by forgetting to set requests/limits, not a class you should intentionally target; it offers zero protection and is the first thing sacrificed under any resource pressure.
A common mistake
Teams sometimes assume setting a generous memory limit alone is protective — but if the request is left low or unset while the limit is high, the Pod lands in Burstable (or effectively unprotected relative to its actual usage), not Guaranteed, and will be evicted before properly-configured Guaranteed Pods even if it's "only" using resources within its stated limit. QoS class is determined by the relationship between requests and limits, not by the limit's absolute value alone.
Being able to state precisely which combination of requests/limits produces each QoS class — not just the three names — and connecting that directly to eviction order under memory pressure demonstrates real operational understanding of why this classification exists, not just textbook recall.
Related Resources
Node affinity — based on node labels
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # hard requirement
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values: ["ssd"]
preferredDuringSchedulingIgnoredDuringExecution: # soft preference
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a"]
This says: this Pod must land on a node labeled disktype=ssd (a hard requirement — the Pod won't be scheduled at all if no such node has room), and, among nodes satisfying that, prefer (but don't require) one in zone us-east-1a. This is a more expressive successor to the simpler nodeSelector field, supporting richer expressions (In, NotIn, Exists, etc.) and the required/preferred distinction.
Pod affinity/anti-affinity — based on co-located Pods
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["cache"]
topologyKey: "kubernetes.io/hostname"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: ["web"]
topologyKey: "kubernetes.io/hostname"
This example combines both: pod affinity requires this Pod to be scheduled on a node that already has a Pod labeled app=cache running on it (useful for co-locating an application with a local cache for lower latency); pod anti-affinity requires this Pod to avoid nodes that already have another Pod labeled app=web (i.e., don't put two replicas of the same "web" application on the same node — a common high-availability pattern, so a single node failure can't take down multiple replicas of the same critical service at once).
The topologyKey — defining what "together" means
topologyKey determines the granularity of "together" — kubernetes.io/hostname means "same node specifically"; topology.kubernetes.io/zone would mean "same availability zone" (a looser, region-level notion of togetherness/separation). Anti-affinity keyed on zone rather than hostname is a common pattern for spreading replicas across failure domains larger than a single node, protecting against a whole zone going down, not just one machine.
Required vs. preferred — hard vs. soft constraints
Both affinity types support requiredDuringSchedulingIgnoredDuringExecution (a hard constraint — the Pod simply won't be scheduled if it can't be satisfied) and preferredDuringSchedulingIgnoredDuringExecution (a soft, weighted preference — the scheduler tries to satisfy it, but will still schedule the Pod elsewhere if it can't). The verbose naming itself is informative: "IgnoredDuringExecution" means these rules are only checked at scheduling time — if labels change after the Pod is already running such that the rule would no longer be satisfied, the already-running Pod isn't evicted retroactively.
Why anti-affinity for high availability is a very common real pattern
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web
topologyKey: "kubernetes.io/hostname"
Using preferred (rather than required) anti-affinity for spreading replicas across nodes is a common, pragmatic middle ground — you get the availability benefit of spreading replicas across different nodes/zones under normal conditions, without the risk of Pods becoming entirely unschedulable during a genuine capacity crunch where a hard requirement couldn't be satisfied (e.g., a small cluster or a zone outage leaving too few eligible nodes).
Being able to distinguish node affinity (Pod vs. node labels) from pod affinity (Pod vs. other Pods) precisely, and knowing when required vs. preferred and hostname vs. zone-level topology keys are the right choice, demonstrates real scheduling design experience beyond just knowing the YAML fields exist.
Related Resources
Applying a taint to a node
kubectl taint nodes gpu-node-1 dedicated=gpu-workloads:NoSchedule
This taint (key=dedicated, value=gpu-workloads, effect=NoSchedule) means: no Pod will be scheduled onto this node unless it carries a matching toleration. Ordinary Pods, with no toleration specified, simply won't be placed here, even if the node has ample free CPU/memory — the taint overrides normal scheduling based purely on resource fit.
Adding a matching toleration to a Pod
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-workloads"
effect: "NoSchedule"
containers:
- name: ml-training
image: ml-trainer:1.0
This Pod tolerates the dedicated=gpu-workloads:NoSchedule taint, meaning it's now eligible to be scheduled on gpu-node-1 — but a toleration only removes the repulsion, it doesn't attract the Pod there. If you specifically want ML workloads to only land on GPU nodes (not merely "allowed to," but "preferentially placed there"), you'd combine this toleration with node affinity (see that question) targeting nodes labeled for GPU capability — taints/tolerations and affinity are complementary, commonly used together.
The three taint effects
| Effect | Behavior for non-tolerating Pods |
|---|---|
NoSchedule | New Pods won't be scheduled here; already-running Pods are unaffected |
PreferNoSchedule | The scheduler tries to avoid placing new Pods here, but it's a soft preference, not a hard rule |
NoExecute | New Pods won't be scheduled here, and existing Pods already running here without a matching toleration are evicted |
NoExecute is the strongest effect — it doesn't just prevent future scheduling, it actively removes Pods that are already there and don't tolerate it. This is exactly the mechanism used, for example, when a node becomes NotReady — the control plane automatically applies a NoExecute taint for node-not-ready conditions, and Pods without a toleration for it are evicted after a grace period (which is itself configurable via tolerationSeconds).
Common real-world uses
- Reserving specialized hardware — tainting GPU nodes so only ML/GPU-requiring workloads (which explicitly tolerate the taint) land there, keeping expensive specialized nodes from being consumed by ordinary workloads.
- Control-plane node protection — control-plane nodes are commonly tainted (
node-role.kubernetes.io/control-plane:NoSchedule) to keep ordinary application workloads off them by default; only specifically-tolerating Pods (often infrastructure DaemonSets — see that question) run there. - Automatic node-condition taints — Kubernetes itself automatically applies taints for conditions like
node.kubernetes.io/not-ready,node.kubernetes.io/memory-pressure, and similar, which is how the system automatically starts repelling (and, forNoExecute, evicting) Pods from an unhealthy node without needing an administrator to intervene manually.
The key distinction from affinity, restated
Node affinity is something a Pod declares about which nodes it wants. Taints are something a node declares about which Pods it's willing to accept. They solve related but inverted problems, and real cluster designs commonly combine both — a taint to keep ordinary workloads off a specialized node by default, plus affinity on the specialized workload's Pods to actively steer them onto that same node.
Related Resources
Defining an HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # target: keep average CPU usage around 60% of requested CPU
This targets the web Deployment, and will scale its replica count between 2 and 10, aiming to keep average CPU utilization across all its Pods near 60% of each Pod's requested CPU (note: this is relative to the request, not the limit — which is exactly why setting sensible CPU requests, as covered in the requests/limits question, is a prerequisite for the HPA to make sensible decisions at all).
The basic algorithm
The HPA controller periodically (by default, every 15 seconds) queries the metrics API for the current average utilization across the target's Pods, and computes a desired replica count using roughly:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue))
If current average CPU utilization is 90% against a 60% target, with 4 current replicas: ceil(4 * (90/60)) = 6 — the HPA scales up to 6 replicas, aiming to bring the average back down toward the target once load is spread across more Pods.
Requires the metrics-server (or another metrics API) to be running
The HPA doesn't collect metrics itself — it queries the Metrics API (typically served by the metrics-server add-on for basic CPU/memory metrics, or a custom/external metrics adapter, often backed by Prometheus, for anything beyond basic resource utilization). Without metrics-server (or an equivalent) installed and functioning in the cluster, an HPA configured for CPU/memory has no data to act on and won't scale at all — a common early "why isn't autoscaling working" gotcha for a newly-set-up cluster.
Scaling on custom and external metrics
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "1000"
Beyond basic CPU/memory, the HPA (autoscaling/v2) supports scaling on custom application-level metrics (e.g., requests-per-second, queue depth) exposed through a custom metrics adapter, or external metrics from outside the cluster entirely (e.g., the depth of an external cloud message queue) — letting scaling decisions reflect the metric that actually matters most for that specific workload's real bottleneck, rather than being limited to CPU/memory alone.
Stabilization and avoiding "flapping"
The HPA includes built-in stabilization logic (configurable stabilization windows) to avoid rapidly scaling up and down in response to short-lived metric spikes — without this, a brief traffic blip could otherwise trigger a scale-up immediately followed by an equally hasty scale-down moments later, adding churn without real benefit.
Knowing that the HPA's CPU-based scaling target is relative to the Pod's requested CPU (not its limit, and not the node's total capacity) is a specific, easily-tested detail that separates surface familiarity from someone who's actually configured and tuned an HPA in practice.