What causes a pod to be OOMKilled, and how do you diagnose and fix it?

Detailed Answer

Confirming OOMKilled is actually the cause

kubectl describe pod my-app-xyz
# Last State:  Terminated
#   Reason:    OOMKilled
#   Exit Code: 137

Exit Code: 137 (128 + 9, where 9 is SIGKILL) combined with Reason: OOMKilled confirms this specific cause definitively — distinguishing it from a generic application crash, which would show a different exit code and reason. This distinction matters because the fix is completely different depending on which one occurred.

The two different scenarios that both produce OOMKilled

Container exceeded its own memory limit — the most common case; the container's cgroup memory usage crossed the configured resources.limits.memory value, and the kernel killed it specifically for exceeding that container-level boundary.
The node itself ran out of memory — less common, but possible if requests/limits across the node are poorly configured (heavy overcommitment) or the kubelet's node-level memory-pressure eviction didn't act quickly enough; in this case even a container technically under its own individual limit can be killed as part of the node trying to reclaim memory generally.

Diagnosing whether this is a real leak or just an under-provisioned limit

kubectl top pod my-app-xyz --containers

Or, better, look at the actual memory usage trend over time in a real monitoring system (Prometheus/Grafana) rather than a single snapshot: a memory usage graph that climbs steadily and never plateaus, correlating with time since the container started (not with request volume), strongly suggests a genuine memory leak in the application — something that will eventually hit any limit you set, no matter how generous, and needs an actual code fix. A memory usage graph that climbs with load and then plateaus at a value just above the configured limit suggests the limit is simply set too low for the application's legitimate, steady-state working set — the fix here is raising the limit to a value with reasonable headroom above observed real usage, not a code change.

Fixing a genuine memory leak

This requires actual application-level investigation — heap dumps/profiling tools appropriate to the language runtime (e.g., a Java heap dump analyzed with a profiler, Node.js's --inspect and Chrome DevTools memory profiling, Python's tracemalloc) to identify what's actually accumulating and never being released. Kubernetes-level tooling can tell you that memory is growing unboundedly and when the kill happens, but not why the application's own code is holding onto memory it should have freed — that's an application-level debugging problem layered on top of the Kubernetes-level symptom.

Fixing an under-provisioned limit

resources:
  requests:
    memory: "512Mi"    # raised to reflect realistic steady-state usage
  limits:
    memory: "768Mi"    # some headroom above typical peak, not unlimited

Raise the limit based on actual measured usage data, not a guess — and consider whether a Vertical Pod Autoscaler (see the scheduling topic), run in recommendation mode, could help right-size this automatically based on real historical usage rather than manual tuning each time.

A subtlety worth knowing: OOMKilled doesn't always mean CrashLoopBackOff

A single OOMKill, if the container then starts fine and runs stably afterward, just shows up as one restart with OOMKilled as the previous state — it only becomes a CrashLoopBackOff if the container keeps hitting the same memory ceiling repeatedly, shortly after each restart. Seeing one isolated OOMKill in history is a signal worth investigating but not necessarily an active incident; a repeating pattern of OOMKills is the more urgent case demanding immediate action.