What's the difference between database statistics and indexes in query optimization?

5 minadvancedindexingstatisticsquery-optimizer

Quick Answer

An **index** is a physical data structure that speeds up finding specific rows. **Statistics** are metadata the optimizer uses to *decide* whether and how to use an index in the first place — value distributions, distinct counts, histograms, and table/index sizes. An index with stale or missing statistics can be effectively invisible to a good query plan, even though the index itself is perfectly healthy, because the optimizer is reasoning from an outdated or wrong picture of the data.

Detailed Answer

The relationship

An index is the structure the optimizer could use; statistics are the evidence the optimizer uses to decide whether using it is actually a good idea, and to estimate how many rows each step of a plan will produce. The optimizer is a cost-based planner: it needs quantitative estimates (row counts, selectivity) to compare candidate plans, and those estimates come entirely from statistics, not from the index structure itself.

What statistics typically capture

  • Total row count and table size.
  • Number of distinct values per column (used to estimate selectivity — see that question).
  • A histogram of value frequency buckets, so the optimizer can estimate WHERE price BETWEEN 10 AND 50 more accurately than assuming a uniform distribution across the whole range.
  • Correlation between a column's order and physical row order (helps estimate range-scan I/O cost).

How they get out of date

-- Bulk load that drastically shifts the data distribution
INSERT INTO orders SELECT * FROM legacy_orders;  -- adds 10 million rows

-- If statistics aren't refreshed, the optimizer still thinks the table
-- is small and the old value distribution still holds -- and may pick
-- a plan (e.g., a nested loop join) that was fine for the old size but
-- is disastrous at the new size.

Most engines auto-update statistics based on a percentage-of-rows-changed threshold (PostgreSQL's autovacuum/autoanalyze, SQL Server's auto-update statistics), but large, fast bulk operations can outrun that threshold's responsiveness, or auto-stats can be disabled in performance-sensitive environments and only run on a schedule.

Manually refreshing statistics

-- PostgreSQL
ANALYZE orders;

-- SQL Server
UPDATE STATISTICS orders;

-- MySQL
ANALYZE TABLE orders;

This is a standard, safe, low-risk first troubleshooting step when a previously-fine query suddenly gets slow after a large data change — often cheaper and faster to try than rewriting the query or adding new indexes, and frequently the actual root cause.

A candidate who says "just add an index" without mentioning statistics is missing half the picture — a perfectly good index can be ignored by the optimizer if the statistics backing its cost estimate are stale, and conversely, refreshing statistics can sometimes fix a bad plan with zero schema changes at all.

Related Resources