Tell me about a time you diagnosed and fixed a production issue in a Kubernetes cluster.

Detailed Answer

This is a behavioral question with real technical substance — the interviewer wants a specific, concrete story demonstrating actual hands-on Kubernetes debugging experience, not a generic or hypothetical account.

A strong structure (STAR-shaped, with real technical detail in the Action)

Situation: A specific, concrete symptom — "a subset of our API's Pods started returning 503s intermittently starting around 2am" is far stronger than "there was an issue with our cluster." Specificity signals a real memory, not a fabricated example.

Task: What was actually at stake, and why the urgency mattered (customer-facing impact, an SLA at risk, a deployment that needed to be rolled back or fixed forward).

Action — the technical depth belongs here:

What was the first thing checked, and why that first? ("I started with kubectl get pods and noticed several Pods showing READY 0/1 while still Running — that told me it was a readiness issue, not a crash, so I didn't waste time chasing CrashLoopBackOff-style causes.")
What did deeper investigation reveal? ("kubectl describe pod showed the readiness probe was timing out, and kubectl logs showed the application was blocking on a slow downstream database query that had started spiking around the same time.")
What was the actual root cause? (a genuine chain of causation — e.g., a database index that had degraded, causing slow queries, causing readiness probe timeouts, causing Pods to be pulled from Service endpoints, causing reduced capacity and 503s under the remaining load.)
What was the fix, and why that fix specifically? ("We added the missing index as an immediate fix, and separately opened a follow-up to tune the readiness probe's timeout, since it was more sensitive to transient slowness than it needed to be.")

Result: Concrete, measurable outcome — "503 rate dropped from 4% back to baseline within 10 minutes of the index being added, and we haven't seen a recurrence in the 3 months since." Specific numbers and a real timeframe are far more convincing than "it got fixed."

What separates a strong answer from a weak one

Weak: "A pod was down, so I restarted it and it was fine." (No real diagnostic depth, doesn't demonstrate understanding of why, sounds generic.)
Strong: Names specific kubectl commands and what each one's output actually told you, traces a genuine causal chain across layers (application → readiness probe → Service endpoints → traffic), and explains the reasoning connecting each step to the next.

Common technical themes worth having a real story ready for

Anything from this stack's observability/troubleshooting topic — CrashLoopBackOff, OOMKilled, a Pod that's Running but not receiving traffic, a slow rollout stuck on a failing readiness probe — or something from the networking or scheduling topics (a NetworkPolicy unexpectedly blocking traffic, a Pod stuck Pending due to resource contention). Being able to go a couple of "why" questions deeper into whichever story you tell — not just the surface-level fix — is what actually distinguishes real production experience from a rehearsed, surface-level account.

Preparing for this question

Have at least one specific, real story ready, complete with the actual commands you ran and what they showed — even a modest incident from a smaller project counts, as long as it demonstrates a genuine, methodical diagnostic process rather than being a vague or hypothetical account.