What causes a pod to be OOMKilled, and how do you diagnose and fix it?
Quick Answer
A Pod is OOMKilled when a container's memory usage exceeds its configured memory limit (or, less commonly, the node itself runs out of memory entirely, even for containers under their individual limits) — the kernel's OOM killer terminates the process, and Kubernetes reports `OOMKilled` as the termination reason. Diagnose by confirming the reason via `kubectl describe pod` and checking actual memory usage trends (via `kubectl top` or a metrics/monitoring system) against the configured limit, then either fix a genuine memory leak in the application or raise the limit to a realistic, measured value if the usage is legitimate.
Detailed Answer
Confirming OOMKilled is actually the cause
kubectl describe pod my-app-xyz
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Exit Code: 137 (128 + 9, where 9 is SIGKILL) combined with Reason: OOMKilled confirms this specific cause definitively — distinguishing it from a generic application crash, which would show a different exit code and reason. This distinction matters because the fix is completely different depending on which one occurred.
The two different scenarios that both produce OOMKilled
- Container exceeded its own memory limit — the most common case; the container's cgroup memory usage crossed the configured
resources.limits.memoryvalue, and the kernel killed it specifically for exceeding that container-level boundary. - The node itself ran out of memory — less common, but possible if requests/limits across the node are poorly configured (heavy overcommitment) or the kubelet's node-level memory-pressure eviction didn't act quickly enough; in this case even a container technically under its own individual limit can be killed as part of the node trying to reclaim memory generally.
Diagnosing whether this is a real leak or just an under-provisioned limit
kubectl top pod my-app-xyz --containers
Or, better, look at the actual memory usage trend over time in a real monitoring system (Prometheus/Grafana) rather than a single snapshot: a memory usage graph that climbs steadily and never plateaus, correlating with time since the container started (not with request volume), strongly suggests a genuine memory leak in the application — something that will eventually hit any limit you set, no matter how generous, and needs an actual code fix. A memory usage graph that climbs with load and then plateaus at a value just above the configured limit suggests the limit is simply set too low for the application's legitimate, steady-state working set — the fix here is raising the limit to a value with reasonable headroom above observed real usage, not a code change.
Fixing a genuine memory leak
This requires actual application-level investigation — heap dumps/profiling tools appropriate to the language runtime (e.g., a Java heap dump analyzed with a profiler, Node.js's --inspect and Chrome DevTools memory profiling, Python's tracemalloc) to identify what's actually accumulating and never being released. Kubernetes-level tooling can tell you that memory is growing unboundedly and when the kill happens, but not why the application's own code is holding onto memory it should have freed — that's an application-level debugging problem layered on top of the Kubernetes-level symptom.
Fixing an under-provisioned limit
resources:
requests:
memory: "512Mi" # raised to reflect realistic steady-state usage
limits:
memory: "768Mi" # some headroom above typical peak, not unlimited
Raise the limit based on actual measured usage data, not a guess — and consider whether a Vertical Pod Autoscaler (see the scheduling topic), run in recommendation mode, could help right-size this automatically based on real historical usage rather than manual tuning each time.
A subtlety worth knowing: OOMKilled doesn't always mean CrashLoopBackOff
A single OOMKill, if the container then starts fine and runs stably afterward, just shows up as one restart with OOMKilled as the previous state — it only becomes a CrashLoopBackOff if the container keeps hitting the same memory ceiling repeatedly, shortly after each restart. Seeing one isolated OOMKill in history is a signal worth investigating but not necessarily an active incident; a repeating pattern of OOMKills is the more urgent case demanding immediate action.