Why three different probes exist
"Is this container healthy" turns out to have several distinct meanings, and conflating them causes real production problems — a container that's alive but not yet ready to serve traffic (still loading a large cache) shouldn't be killed, and a slow-starting container shouldn't be judged against the same timing as a fully-warmed-up one.
Liveness probe — is this container still working?
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
If this probe fails failureThreshold times in a row, the kubelet kills the container and restarts it (subject to the Pod's restartPolicy). This is meant to catch situations where a process is technically still running but has gotten into a genuinely broken state it can't recover from on its own (deadlocked, stuck in an infinite loop) — a restart is the appropriate remedy. A liveness probe should only fail for problems a restart would actually fix — a liveness probe that checks a downstream dependency (like a database connection) is a common and dangerous misconfiguration, since it causes the container to be endlessly restarted for a problem restarting it can't solve at all (the database being down), rather than just marking it not-ready.
Readiness probe — is this container currently able to serve traffic?
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
If this probe fails, the Pod is removed from the Service's endpoints (see the networking topic) — traffic stops being routed to it — but the container is left running, not restarted. This is the correct mechanism for temporary, self-resolving unavailability: warming up a cache on startup, briefly reconnecting to a dependency, or gracefully draining in-flight requests before shutdown. This is also exactly the mechanism that makes rolling updates safe (see the workload controllers topic) — a new Pod only starts receiving traffic once its readiness probe passes.
Startup probe — protects slow-starting containers
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10 # allows up to 300 seconds (30 x 10) for startup
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
While a startup probe is configured and hasn't yet succeeded, liveness and readiness probes are disabled entirely — this exists specifically for applications with a slow, variable startup time (a large in-memory cache warm-up, a JVM application with a long class-loading phase), where a liveness probe's normal, tighter timing (tuned for steady-state health checking) would otherwise kill the container for simply still being in its legitimate startup phase, before it ever got a chance to finish starting.
Side-by-side summary
| Checks | On failure | Typical use | |
|---|---|---|---|
| Liveness | Is the process still functioning | Kill and restart the container | Detecting deadlocks/unrecoverable internal states |
| Readiness | Can it currently serve traffic | Remove from Service endpoints (no restart) | Temporary unavailability, warm-up, graceful shutdown |
| Startup | Has the (slow) startup completed | Delays liveness/readiness checks until success | Slow-starting applications, avoiding premature liveness kills |
Why getting this wrong causes real incidents
The single most common probe misconfiguration is a liveness probe that's too strict, or checks the wrong thing (a downstream dependency instead of the process's own health) — this produces exactly the symptom covered in the CrashLoopBackOff troubleshooting question: a container repeatedly killed and restarted for a condition restarting can never actually fix, often making an already-degraded situation (a slow dependency) actively worse by adding restart churn on top of it.
Related Resources
What CrashLoopBackOff actually means
The container is starting, then exiting (crashing, or being killed) repeatedly, and Kubernetes is deliberately backing off between restart attempts (waiting progressively longer — 10s, 20s, 40s, up to a cap around 5 minutes) rather than restarting instantly and indefinitely, which would otherwise hammer the node with a tight restart loop.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# my-app-xyz 0/1 CrashLoopBackOff 7 12m
Step 1: check the previous (crashed) container's logs
kubectl logs my-app-xyz --previous
This is the single most important first command — --previous retrieves logs from the last terminated instance of the container, which usually contains the actual error message explaining why it crashed (a stack trace, a missing environment variable error, a failed database connection). Without --previous, kubectl logs shows the current (possibly still-starting, possibly not-yet-logged-anything) attempt, which may be empty or unhelpful.
Step 2: check describe for exit code and events
kubectl describe pod my-app-xyz
Look specifically at:
Last State: Terminated,Reason,Exit Code— a specific exit code narrows down the cause considerably:0(clean exit — odd for a container that's supposed to run forever, might indicate the main process finished and exited normally when it shouldn't have),1(generic application error),137(128+SIGKILL— often an OOMKill, see that question, or a liveness probe failure),143(128+SIGTERM— graceful termination, possibly from a liveness probe or a manual action).- Events at the bottom — often shows directly whether a liveness probe is failing and killing the container (
Liveness probe failed: ...), which points straight at a probe misconfiguration rather than the application itself being broken.
Common root causes, roughly in order of frequency
- Application error on startup — a missing environment variable, a bad configuration file, an unhandled exception during initialization. The
--previouslogs should show this directly. - Misconfigured liveness probe — the probe is checking something that isn't actually indicative of a fatal problem (e.g., a downstream dependency being briefly unavailable), causing an otherwise-healthy container to be killed repeatedly (see the probes question).
- Missing dependency/resource — the application can't reach a required database, another service, or a mounted ConfigMap/Secret that doesn't exist or is misnamed.
- OOMKilled (see that question) — the container is repeatedly exceeding its memory limit;
describe podwill showOOMKilledas the termination reason distinctly from a generic crash. - Immediate exit due to incorrect container command/entrypoint — e.g., a container built to run a one-shot script rather than a long-running server process, used incorrectly in a Deployment (which expects the main process to keep running).
When logs alone aren't enough
kubectl exec -it my-app-xyz -- /bin/sh # only works if the container is currently running long enough
If the crash happens too fast to exec into the container, consider temporarily overriding the Pod's command to something that keeps it alive long enough to investigate interactively (command: ["sleep", "3600"], in a debug copy of the manifest — never in the real production manifest), or use kubectl debug (a newer, purpose-built command for attaching an ephemeral debug container to a running or crashing Pod) to get a shell alongside the problematic container without needing to modify its spec at all.
Naming --previous specifically (rather than just "check the logs") is a strong, concrete signal of hands-on debugging experience — it's the detail that trips up people who've only read about Kubernetes without actually having debugged a real crash loop.
Related Resources
ImagePullBackOff — the kubelet can't pull the image
kubectl describe pod my-app-xyz
# Events:
# Warning Failed kubelet Failed to pull image "myapp:1.0.": rpc error: ...
# Warning BackOff kubelet Back-off pulling image "myapp:1.0."
Common causes, in rough order of frequency:
- Typo in the image name or tag — a trailing period (as in the example above —
myapp:1.0.instead ofmyapp:1.0), a misspelled repository name, or a tag that was never actually pushed.kubectl describe podshows the exact image string the kubelet tried to pull, and the exact error the registry returned — often enough to spot the typo immediately. - Private registry requiring authentication — if the image is in a private registry and no credentials are configured, the pull fails with an authentication/authorization error. Fix by creating a
docker-registrySecret and referencing it viaimagePullSecretsin the Pod spec (or the associated ServiceAccount, so every Pod using that ServiceAccount picks it up automatically).
spec:
imagePullSecrets:
- name: my-registry-credentials
containers:
- name: app
image: private-registry.example.com/myapp:1.0
- Network connectivity issue — the node genuinely can't reach the registry (a firewall rule, a DNS resolution problem, the registry being down) — worth checking directly from a node or a debug Pod if credentials and image name both check out.
- Rate limiting — some public registries (notably Docker Hub) impose pull rate limits per IP/account; a burst of Pod creations across many nodes can occasionally hit this, especially without an authenticated account configured for higher limits.
Pending — the scheduler can't place the Pod anywhere
kubectl describe pod my-app-xyz
# Events:
# Warning FailedScheduling default-scheduler 0/5 nodes are available:
# 3 Insufficient memory, 2 node(s) had taint {dedicated: gpu}, that the pod didn't tolerate.
The Events section of describe pod is the essential first stop — it states, in plain language, exactly why every node was rejected during scheduling's filtering phase (see the scheduling question). Common reasons:
- Insufficient resources — no node has enough unreserved CPU/memory to satisfy the Pod's requests; either the cluster genuinely needs more capacity (Cluster Autoscaler should address this automatically if configured — see that question), or the Pod's requests are set unrealistically high.
- Unsatisfied taints/tolerations or node affinity — the Pod requires something (a specific node label, tolerance for a taint) that no current node provides.
- Volume/topology issues — a PersistentVolumeClaim can't be bound or provisioned (e.g., a StorageClass misconfiguration, or a zone-topology mismatch between where the volume was created and where the Pod could be scheduled — see the StorageClass question).
- PodDisruptionBudget or admission webhook rejection — less common for a purely
Pendingstate, but worth checking if the Events mention an admission controller rejecting the request outright.
The universal first diagnostic command
kubectl describe pod <pod-name>
For both ImagePullBackOff and Pending, this single command's Events section is almost always where the actual, specific, human-readable reason lives — the general debugging instinct should always be "read the events before guessing," rather than jumping straight to speculation about what might be wrong.
Related Resources
kubectl describe — the essential first command
kubectl describe pod my-app-xyz
Shows the Pod's full spec, its current status/conditions (see the pod-lifecycle question), each container's current and last state (with exit codes/reasons), and — critically — the Events section at the bottom, which is a chronological log of everything that's happened to this specific object recently (scheduling decisions, probe failures, image pull attempts). This should almost always be the very first command run when investigating any Pod problem.
kubectl logs — application-level output
kubectl logs my-app-xyz # current container's stdout/stderr
kubectl logs my-app-xyz --previous # the LAST TERMINATED instance's logs (essential for crash loops)
kubectl logs my-app-xyz -c sidecar-container # a specific container, for multi-container Pods
kubectl logs my-app-xyz --since=10m # only recent logs, useful on a noisy long-running app
kubectl logs -f my-app-xyz # follow/stream logs live
kubectl exec — an interactive shell inside the container
kubectl exec -it my-app-xyz -- /bin/sh
Lets you poke around inside a currently running container directly — check environment variables, test network connectivity to a dependency, inspect mounted config files. Only useful if the container stays up long enough to attach to (not helpful for a container that crashes within milliseconds of starting).
kubectl debug — attaching to a Pod without modifying it
kubectl debug -it my-app-xyz --image=busybox --target=my-app-xyz
A more modern alternative for cases where exec isn't sufficient — e.g., the container image itself has no shell at all (common for minimal/distroless production images), or the Pod is crashing too fast to exec into. This attaches an ephemeral debug container to the existing Pod (sharing its network/process namespace, depending on flags), letting you investigate using a full-featured debug image without altering the original Pod's spec.
kubectl get events — cluster/namespace-wide recent activity
kubectl get events --sort-by=.lastTimestamp -n production
Useful when you're not yet sure which specific object is actually at fault — shows recent events across the whole namespace (or cluster, with -A), which can reveal a problem at a different layer than the one you started investigating (e.g., you're looking at a Pod, but the real root cause event was a failed PVC provisioning or a node becoming NotReady).
kubectl top — current resource usage
kubectl top pod my-app-xyz
kubectl top node
Requires metrics-server (or an equivalent) to be running in the cluster (see the HPA question) — shows current, real-time CPU/memory usage, useful for confirming whether a Pod is actually approaching its resource limits (a lead-in to investigating OOMKills or CPU throttling — see those questions) without needing a full metrics/monitoring stack for a quick, immediate check.
The general debugging workflow this toolkit supports
kubectl get pods— spot which Pod(s) are unhealthy and their current status/phase.kubectl describe pod— get the specific reason (events, conditions, container states).kubectl logs(with--previousif relevant) — get the application's own explanation, if it logged one.kubectl exec/kubectl debug— interactively investigate further if logs/describe aren't sufficient.kubectl get events/kubectl top— widen the investigation if the root cause seems to be somewhere other than the Pod itself.
Being fluent with this sequence — not just knowing the commands exist, but knowing the order and reason to reach for each — is what separates real hands-on troubleshooting experience from surface familiarity with kubectl's command list.
Related Resources
Confirming OOMKilled is actually the cause
kubectl describe pod my-app-xyz
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Exit Code: 137 (128 + 9, where 9 is SIGKILL) combined with Reason: OOMKilled confirms this specific cause definitively — distinguishing it from a generic application crash, which would show a different exit code and reason. This distinction matters because the fix is completely different depending on which one occurred.
The two different scenarios that both produce OOMKilled
- Container exceeded its own memory limit — the most common case; the container's cgroup memory usage crossed the configured
resources.limits.memoryvalue, and the kernel killed it specifically for exceeding that container-level boundary. - The node itself ran out of memory — less common, but possible if requests/limits across the node are poorly configured (heavy overcommitment) or the kubelet's node-level memory-pressure eviction didn't act quickly enough; in this case even a container technically under its own individual limit can be killed as part of the node trying to reclaim memory generally.
Diagnosing whether this is a real leak or just an under-provisioned limit
kubectl top pod my-app-xyz --containers
Or, better, look at the actual memory usage trend over time in a real monitoring system (Prometheus/Grafana) rather than a single snapshot: a memory usage graph that climbs steadily and never plateaus, correlating with time since the container started (not with request volume), strongly suggests a genuine memory leak in the application — something that will eventually hit any limit you set, no matter how generous, and needs an actual code fix. A memory usage graph that climbs with load and then plateaus at a value just above the configured limit suggests the limit is simply set too low for the application's legitimate, steady-state working set — the fix here is raising the limit to a value with reasonable headroom above observed real usage, not a code change.
Fixing a genuine memory leak
This requires actual application-level investigation — heap dumps/profiling tools appropriate to the language runtime (e.g., a Java heap dump analyzed with a profiler, Node.js's --inspect and Chrome DevTools memory profiling, Python's tracemalloc) to identify what's actually accumulating and never being released. Kubernetes-level tooling can tell you that memory is growing unboundedly and when the kill happens, but not why the application's own code is holding onto memory it should have freed — that's an application-level debugging problem layered on top of the Kubernetes-level symptom.
Fixing an under-provisioned limit
resources:
requests:
memory: "512Mi" # raised to reflect realistic steady-state usage
limits:
memory: "768Mi" # some headroom above typical peak, not unlimited
Raise the limit based on actual measured usage data, not a guess — and consider whether a Vertical Pod Autoscaler (see the scheduling topic), run in recommendation mode, could help right-size this automatically based on real historical usage rather than manual tuning each time.
A subtlety worth knowing: OOMKilled doesn't always mean CrashLoopBackOff
A single OOMKill, if the container then starts fine and runs stably afterward, just shows up as one restart with OOMKilled as the previous state — it only becomes a CrashLoopBackOff if the container keeps hitting the same memory ceiling repeatedly, shortly after each restart. Seeing one isolated OOMKill in history is a signal worth investigating but not necessarily an active incident; a repeating pattern of OOMKills is the more urgent case demanding immediate action.