Observability and Troubleshooting

Difficulty

Why three different probes exist

"Is this container healthy" turns out to have several distinct meanings, and conflating them causes real production problems — a container that's alive but not yet ready to serve traffic (still loading a large cache) shouldn't be killed, and a slow-starting container shouldn't be judged against the same timing as a fully-warmed-up one.

Liveness probe — is this container still working?

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 10
  failureThreshold: 3

If this probe fails failureThreshold times in a row, the kubelet kills the container and restarts it (subject to the Pod's restartPolicy). This is meant to catch situations where a process is technically still running but has gotten into a genuinely broken state it can't recover from on its own (deadlocked, stuck in an infinite loop) — a restart is the appropriate remedy. A liveness probe should only fail for problems a restart would actually fix — a liveness probe that checks a downstream dependency (like a database connection) is a common and dangerous misconfiguration, since it causes the container to be endlessly restarted for a problem restarting it can't solve at all (the database being down), rather than just marking it not-ready.

Readiness probe — is this container currently able to serve traffic?

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5

If this probe fails, the Pod is removed from the Service's endpoints (see the networking topic) — traffic stops being routed to it — but the container is left running, not restarted. This is the correct mechanism for temporary, self-resolving unavailability: warming up a cache on startup, briefly reconnecting to a dependency, or gracefully draining in-flight requests before shutdown. This is also exactly the mechanism that makes rolling updates safe (see the workload controllers topic) — a new Pod only starts receiving traffic once its readiness probe passes.

Startup probe — protects slow-starting containers

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10       # allows up to 300 seconds (30 x 10) for startup
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10

While a startup probe is configured and hasn't yet succeeded, liveness and readiness probes are disabled entirely — this exists specifically for applications with a slow, variable startup time (a large in-memory cache warm-up, a JVM application with a long class-loading phase), where a liveness probe's normal, tighter timing (tuned for steady-state health checking) would otherwise kill the container for simply still being in its legitimate startup phase, before it ever got a chance to finish starting.

Side-by-side summary

ChecksOn failureTypical use
LivenessIs the process still functioningKill and restart the containerDetecting deadlocks/unrecoverable internal states
ReadinessCan it currently serve trafficRemove from Service endpoints (no restart)Temporary unavailability, warm-up, graceful shutdown
StartupHas the (slow) startup completedDelays liveness/readiness checks until successSlow-starting applications, avoiding premature liveness kills

Why getting this wrong causes real incidents

The single most common probe misconfiguration is a liveness probe that's too strict, or checks the wrong thing (a downstream dependency instead of the process's own health) — this produces exactly the symptom covered in the CrashLoopBackOff troubleshooting question: a container repeatedly killed and restarted for a condition restarting can never actually fix, often making an already-degraded situation (a slow dependency) actively worse by adding restart churn on top of it.

What CrashLoopBackOff actually means

The container is starting, then exiting (crashing, or being killed) repeatedly, and Kubernetes is deliberately backing off between restart attempts (waiting progressively longer — 10s, 20s, 40s, up to a cap around 5 minutes) rather than restarting instantly and indefinitely, which would otherwise hammer the node with a tight restart loop.

kubectl get pods
# NAME        READY   STATUS             RESTARTS   AGE
# my-app-xyz  0/1     CrashLoopBackOff   7          12m

Step 1: check the previous (crashed) container's logs

kubectl logs my-app-xyz --previous

This is the single most important first command — --previous retrieves logs from the last terminated instance of the container, which usually contains the actual error message explaining why it crashed (a stack trace, a missing environment variable error, a failed database connection). Without --previous, kubectl logs shows the current (possibly still-starting, possibly not-yet-logged-anything) attempt, which may be empty or unhelpful.

Step 2: check describe for exit code and events

kubectl describe pod my-app-xyz

Look specifically at:

  • Last State: Terminated, Reason, Exit Code — a specific exit code narrows down the cause considerably: 0 (clean exit — odd for a container that's supposed to run forever, might indicate the main process finished and exited normally when it shouldn't have), 1 (generic application error), 137 (128+SIGKILL — often an OOMKill, see that question, or a liveness probe failure), 143 (128+SIGTERM — graceful termination, possibly from a liveness probe or a manual action).
  • Events at the bottom — often shows directly whether a liveness probe is failing and killing the container (Liveness probe failed: ...), which points straight at a probe misconfiguration rather than the application itself being broken.

Common root causes, roughly in order of frequency

  1. Application error on startup — a missing environment variable, a bad configuration file, an unhandled exception during initialization. The --previous logs should show this directly.
  2. Misconfigured liveness probe — the probe is checking something that isn't actually indicative of a fatal problem (e.g., a downstream dependency being briefly unavailable), causing an otherwise-healthy container to be killed repeatedly (see the probes question).
  3. Missing dependency/resource — the application can't reach a required database, another service, or a mounted ConfigMap/Secret that doesn't exist or is misnamed.
  4. OOMKilled (see that question) — the container is repeatedly exceeding its memory limit; describe pod will show OOMKilled as the termination reason distinctly from a generic crash.
  5. Immediate exit due to incorrect container command/entrypoint — e.g., a container built to run a one-shot script rather than a long-running server process, used incorrectly in a Deployment (which expects the main process to keep running).

When logs alone aren't enough

kubectl exec -it my-app-xyz -- /bin/sh    # only works if the container is currently running long enough

If the crash happens too fast to exec into the container, consider temporarily overriding the Pod's command to something that keeps it alive long enough to investigate interactively (command: ["sleep", "3600"], in a debug copy of the manifest — never in the real production manifest), or use kubectl debug (a newer, purpose-built command for attaching an ephemeral debug container to a running or crashing Pod) to get a shell alongside the problematic container without needing to modify its spec at all.

Naming --previous specifically (rather than just "check the logs") is a strong, concrete signal of hands-on debugging experience — it's the detail that trips up people who've only read about Kubernetes without actually having debugged a real crash loop.

ImagePullBackOff — the kubelet can't pull the image

kubectl describe pod my-app-xyz
# Events:
#   Warning  Failed     kubelet  Failed to pull image "myapp:1.0.": rpc error: ...
#   Warning  BackOff    kubelet  Back-off pulling image "myapp:1.0."

Common causes, in rough order of frequency:

  1. Typo in the image name or tag — a trailing period (as in the example above — myapp:1.0. instead of myapp:1.0), a misspelled repository name, or a tag that was never actually pushed. kubectl describe pod shows the exact image string the kubelet tried to pull, and the exact error the registry returned — often enough to spot the typo immediately.
  2. Private registry requiring authentication — if the image is in a private registry and no credentials are configured, the pull fails with an authentication/authorization error. Fix by creating a docker-registry Secret and referencing it via imagePullSecrets in the Pod spec (or the associated ServiceAccount, so every Pod using that ServiceAccount picks it up automatically).
spec:
  imagePullSecrets:
    - name: my-registry-credentials
  containers:
    - name: app
      image: private-registry.example.com/myapp:1.0
  1. Network connectivity issue — the node genuinely can't reach the registry (a firewall rule, a DNS resolution problem, the registry being down) — worth checking directly from a node or a debug Pod if credentials and image name both check out.
  2. Rate limiting — some public registries (notably Docker Hub) impose pull rate limits per IP/account; a burst of Pod creations across many nodes can occasionally hit this, especially without an authenticated account configured for higher limits.

Pending — the scheduler can't place the Pod anywhere

kubectl describe pod my-app-xyz
# Events:
#   Warning  FailedScheduling  default-scheduler  0/5 nodes are available:
#     3 Insufficient memory, 2 node(s) had taint {dedicated: gpu}, that the pod didn't tolerate.

The Events section of describe pod is the essential first stop — it states, in plain language, exactly why every node was rejected during scheduling's filtering phase (see the scheduling question). Common reasons:

  1. Insufficient resources — no node has enough unreserved CPU/memory to satisfy the Pod's requests; either the cluster genuinely needs more capacity (Cluster Autoscaler should address this automatically if configured — see that question), or the Pod's requests are set unrealistically high.
  2. Unsatisfied taints/tolerations or node affinity — the Pod requires something (a specific node label, tolerance for a taint) that no current node provides.
  3. Volume/topology issues — a PersistentVolumeClaim can't be bound or provisioned (e.g., a StorageClass misconfiguration, or a zone-topology mismatch between where the volume was created and where the Pod could be scheduled — see the StorageClass question).
  4. PodDisruptionBudget or admission webhook rejection — less common for a purely Pending state, but worth checking if the Events mention an admission controller rejecting the request outright.

The universal first diagnostic command

kubectl describe pod <pod-name>

For both ImagePullBackOff and Pending, this single command's Events section is almost always where the actual, specific, human-readable reason lives — the general debugging instinct should always be "read the events before guessing," rather than jumping straight to speculation about what might be wrong.

Related Resources

kubectl describe — the essential first command

kubectl describe pod my-app-xyz

Shows the Pod's full spec, its current status/conditions (see the pod-lifecycle question), each container's current and last state (with exit codes/reasons), and — critically — the Events section at the bottom, which is a chronological log of everything that's happened to this specific object recently (scheduling decisions, probe failures, image pull attempts). This should almost always be the very first command run when investigating any Pod problem.

kubectl logs — application-level output

kubectl logs my-app-xyz                      # current container's stdout/stderr
kubectl logs my-app-xyz --previous            # the LAST TERMINATED instance's logs (essential for crash loops)
kubectl logs my-app-xyz -c sidecar-container   # a specific container, for multi-container Pods
kubectl logs my-app-xyz --since=10m            # only recent logs, useful on a noisy long-running app
kubectl logs -f my-app-xyz                     # follow/stream logs live

kubectl exec — an interactive shell inside the container

kubectl exec -it my-app-xyz -- /bin/sh

Lets you poke around inside a currently running container directly — check environment variables, test network connectivity to a dependency, inspect mounted config files. Only useful if the container stays up long enough to attach to (not helpful for a container that crashes within milliseconds of starting).

kubectl debug — attaching to a Pod without modifying it

kubectl debug -it my-app-xyz --image=busybox --target=my-app-xyz

A more modern alternative for cases where exec isn't sufficient — e.g., the container image itself has no shell at all (common for minimal/distroless production images), or the Pod is crashing too fast to exec into. This attaches an ephemeral debug container to the existing Pod (sharing its network/process namespace, depending on flags), letting you investigate using a full-featured debug image without altering the original Pod's spec.

kubectl get events — cluster/namespace-wide recent activity

kubectl get events --sort-by=.lastTimestamp -n production

Useful when you're not yet sure which specific object is actually at fault — shows recent events across the whole namespace (or cluster, with -A), which can reveal a problem at a different layer than the one you started investigating (e.g., you're looking at a Pod, but the real root cause event was a failed PVC provisioning or a node becoming NotReady).

kubectl top — current resource usage

kubectl top pod my-app-xyz
kubectl top node

Requires metrics-server (or an equivalent) to be running in the cluster (see the HPA question) — shows current, real-time CPU/memory usage, useful for confirming whether a Pod is actually approaching its resource limits (a lead-in to investigating OOMKills or CPU throttling — see those questions) without needing a full metrics/monitoring stack for a quick, immediate check.

The general debugging workflow this toolkit supports

  1. kubectl get pods — spot which Pod(s) are unhealthy and their current status/phase.
  2. kubectl describe pod — get the specific reason (events, conditions, container states).
  3. kubectl logs (with --previous if relevant) — get the application's own explanation, if it logged one.
  4. kubectl exec/kubectl debug — interactively investigate further if logs/describe aren't sufficient.
  5. kubectl get events/kubectl top — widen the investigation if the root cause seems to be somewhere other than the Pod itself.

Being fluent with this sequence — not just knowing the commands exist, but knowing the order and reason to reach for each — is what separates real hands-on troubleshooting experience from surface familiarity with kubectl's command list.

Confirming OOMKilled is actually the cause

kubectl describe pod my-app-xyz
# Last State:  Terminated
#   Reason:    OOMKilled
#   Exit Code: 137

Exit Code: 137 (128 + 9, where 9 is SIGKILL) combined with Reason: OOMKilled confirms this specific cause definitively — distinguishing it from a generic application crash, which would show a different exit code and reason. This distinction matters because the fix is completely different depending on which one occurred.

The two different scenarios that both produce OOMKilled

  1. Container exceeded its own memory limit — the most common case; the container's cgroup memory usage crossed the configured resources.limits.memory value, and the kernel killed it specifically for exceeding that container-level boundary.
  2. The node itself ran out of memory — less common, but possible if requests/limits across the node are poorly configured (heavy overcommitment) or the kubelet's node-level memory-pressure eviction didn't act quickly enough; in this case even a container technically under its own individual limit can be killed as part of the node trying to reclaim memory generally.

Diagnosing whether this is a real leak or just an under-provisioned limit

kubectl top pod my-app-xyz --containers

Or, better, look at the actual memory usage trend over time in a real monitoring system (Prometheus/Grafana) rather than a single snapshot: a memory usage graph that climbs steadily and never plateaus, correlating with time since the container started (not with request volume), strongly suggests a genuine memory leak in the application — something that will eventually hit any limit you set, no matter how generous, and needs an actual code fix. A memory usage graph that climbs with load and then plateaus at a value just above the configured limit suggests the limit is simply set too low for the application's legitimate, steady-state working set — the fix here is raising the limit to a value with reasonable headroom above observed real usage, not a code change.

Fixing a genuine memory leak

This requires actual application-level investigation — heap dumps/profiling tools appropriate to the language runtime (e.g., a Java heap dump analyzed with a profiler, Node.js's --inspect and Chrome DevTools memory profiling, Python's tracemalloc) to identify what's actually accumulating and never being released. Kubernetes-level tooling can tell you that memory is growing unboundedly and when the kill happens, but not why the application's own code is holding onto memory it should have freed — that's an application-level debugging problem layered on top of the Kubernetes-level symptom.

Fixing an under-provisioned limit

resources:
  requests:
    memory: "512Mi"    # raised to reflect realistic steady-state usage
  limits:
    memory: "768Mi"    # some headroom above typical peak, not unlimited

Raise the limit based on actual measured usage data, not a guess — and consider whether a Vertical Pod Autoscaler (see the scheduling topic), run in recommendation mode, could help right-size this automatically based on real historical usage rather than manual tuning each time.

A subtlety worth knowing: OOMKilled doesn't always mean CrashLoopBackOff

A single OOMKill, if the container then starts fine and runs stably afterward, just shows up as one restart with OOMKilled as the previous state — it only becomes a CrashLoopBackOff if the container keeps hitting the same memory ceiling repeatedly, shortly after each restart. Seeing one isolated OOMKill in history is a signal worth investigating but not necessarily an active incident; a repeating pattern of OOMKills is the more urgent case demanding immediate action.