How do you approach debugging a production issue in a Java application you didn't write?

7 minintermediatedebuggingproductionbehavioral

Quick Answer

Start by reproducing or narrowing down the failure from logs, metrics, and error messages before touching code; check for recent deploys/config changes as the most likely cause; use a debugger, thread dump (jstack), or heap dump (jmap) if the symptom is a hang, deadlock, or memory issue; form a hypothesis, verify it with the smallest possible test, and fix the root cause rather than the symptom, adding a regression test once resolved.

Detailed Answer

A structured approach, regardless of who wrote the original code:

  1. Gather evidence before touching code. Logs, error messages, stack traces, metrics/dashboards, and recent deploy/config-change history usually narrow the problem space enormously before you write a single line — a huge fraction of production incidents trace back to a recent change.

  2. Reproduce, or narrow the conditions, if at all possible. A reliably reproducible bug is far easier to fix than one you can only observe in production; even a partial repro (same input shape, same load pattern) helps.

  3. Match the tool to the symptom:

    • A hang or high CPU: take a thread dump (jstack <pid>, or kill -3 for a plain stack trace to stdout) — look for threads stuck in the same call, or an explicit deadlock report.
    • Memory growth / OutOfMemoryError: a heap dump (jmap -dump) analyzed in a tool like Eclipse MAT, looking at what's retaining the most memory and why it's still reachable.
    • A specific logic bug: attach a remote debugger if the environment allows it, or add targeted logging around the suspected code path and redeploy to a staging/canary environment.
  4. Form a hypothesis, then verify it minimally — resist the urge to change several things at once; change one variable, observe, and confirm before moving to a fix.

  5. Fix the root cause, not just the symptom, and add a regression test (or at least a monitoring alert) so the same failure mode is caught automatically next time.

  6. Communicate clearly — especially for an unfamiliar codebase, documenting what you learned about the system along the way (in comments, a wiki, or a postmortem) pays off for the next person who touches it.