How do you approach debugging a production issue in a Java application you didn't write?
Quick Answer
Start by reproducing or narrowing down the failure from logs, metrics, and error messages before touching code; check for recent deploys/config changes as the most likely cause; use a debugger, thread dump (jstack), or heap dump (jmap) if the symptom is a hang, deadlock, or memory issue; form a hypothesis, verify it with the smallest possible test, and fix the root cause rather than the symptom, adding a regression test once resolved.
Detailed Answer
A structured approach, regardless of who wrote the original code:
-
Gather evidence before touching code. Logs, error messages, stack traces, metrics/dashboards, and recent deploy/config-change history usually narrow the problem space enormously before you write a single line — a huge fraction of production incidents trace back to a recent change.
-
Reproduce, or narrow the conditions, if at all possible. A reliably reproducible bug is far easier to fix than one you can only observe in production; even a partial repro (same input shape, same load pattern) helps.
-
Match the tool to the symptom:
- A hang or high CPU: take a thread dump (
jstack <pid>, orkill -3for a plain stack trace to stdout) — look for threads stuck in the same call, or an explicit deadlock report. - Memory growth / OutOfMemoryError: a heap dump (
jmap -dump) analyzed in a tool like Eclipse MAT, looking at what's retaining the most memory and why it's still reachable. - A specific logic bug: attach a remote debugger if the environment allows it, or add targeted logging around the suspected code path and redeploy to a staging/canary environment.
- A hang or high CPU: take a thread dump (
-
Form a hypothesis, then verify it minimally — resist the urge to change several things at once; change one variable, observe, and confirm before moving to a fix.
-
Fix the root cause, not just the symptom, and add a regression test (or at least a monitoring alert) so the same failure mode is caught automatically next time.
-
Communicate clearly — especially for an unfamiliar codebase, documenting what you learned about the system along the way (in comments, a wiki, or a postmortem) pays off for the next person who touches it.