How would you diagnose and fix a slow query in production?
Quick Answer
Get the actual execution plan (`EXPLAIN ANALYZE`) rather than guessing, compare estimated vs. actual row counts to spot stale statistics, check for missing/unused indexes and non-sargable predicates, and look for join-order or row-explosion issues. Fix the highest-leverage cause first — usually a missing index or a rewritten predicate — verify with the plan again, and only then consider heavier changes like denormalization, caching, or schema changes.
Detailed Answer
A structured approach, roughly in order of "cheapest to check, highest signal first":
1. Confirm it's actually the query, not something else
Rule out lock contention (is the query waiting on another transaction's lock, not actually doing slow work?), connection pool exhaustion, or network latency before assuming the query plan itself is the problem. Check pg_stat_activity (PostgreSQL) or equivalent to see if the query is active and burning CPU/IO, or idle in transaction/blocked waiting on a lock.
2. Get the real execution plan
EXPLAIN ANALYZE
SELECT ...; -- the actual slow query, with representative parameter values
Never guess from just reading the SQL — the optimizer's actual chosen plan (scan types, join order, join algorithm) is often surprising.
3. Compare estimated vs. actual row counts
A large estimate/actual gap at any node is the single most common root cause of a bad plan — it means the optimizer is working from wrong information. This usually points to stale statistics (ANALYZE/UPDATE STATISTICS) or a predicate the optimizer inherently can't estimate well (e.g., a correlated condition across two columns it assumes are independent).
4. Look for the classic culprits, in the plan
- Full table scan on a large table where you expected an index seek — check for a missing index, or a non-sargable predicate (function wrapping the column — see that question) defeating an existing one.
- Nested loop with a high iteration count on an inner side that isn't indexed — O(n×m) behavior hiding inside what looks like a normal join.
- Unnecessary sort — often removable if an index already provides the required order for
ORDER BY. - Row-count explosion mid-plan — a sign of an unintended one-to-many join fan-out (see the join explosion question) before aggregation.
5. Apply the smallest fix that addresses the actual bottleneck
- Missing index → add the specific index the plan is missing, verify with
EXPLAIN ANALYZEagain that it's actually chosen and helps. - Non-sargable predicate → rewrite the predicate, or add a matching expression index.
- Stale statistics → refresh them; sometimes this alone fixes the plan.
- Genuine algorithmic/schema issue (deep pagination via large
OFFSET, unnecessary joins, a report needing full aggregation from raw rows) → consider keyset pagination, denormalization, a materialized view, or caching — but only after confirming a simpler index/predicate fix isn't sufficient.
6. Re-measure, don't assume
Always re-run EXPLAIN ANALYZE (and ideally a real load test) after the fix to confirm the plan actually changed and performance actually improved — it's common to "fix" a query in a way that helps the specific parameter values tested but doesn't generalize, or to add an index that the optimizer still doesn't choose to use.
What interviewers are listening for
A methodical, plan-driven process (measure → hypothesize → verify) rather than jumping straight to "add an index" or "add a cache" without first confirming what's actually happening in the plan. Mentioning statistics staleness and sargability specifically signals real production experience, not just textbook knowledge.