Indexing and Query Performance

Difficulty

Without an index, finding a row requires a full table scan — reading every page of the table to check each row against the condition, which is O(n).

What an index actually is

An index is a redundant, auxiliary copy of one or more columns' worth of data, stored in a structure optimized for searching — most commonly a B-tree (or its variant, the B+tree, which most real databases actually use). Maintaining it costs extra storage and slows down writes slightly (every INSERT/UPDATE/DELETE must also update every affected index), in exchange for much faster reads on the indexed column(s).

B-tree structure

                    [ 50 ]
                 /          \
          [ 20, 35 ]      [ 70, 90 ]
          /   |   \        /   |   \
      leaf  leaf  leaf  leaf  leaf  leaf   <- sorted, linked leaf pages
       (actual key values + row pointers)
  • Internal (non-leaf) nodes hold routing keys used purely to decide which child to descend into — they don't necessarily correspond to real rows.
  • Leaf nodes hold the actual indexed key values in sorted order, each paired with a pointer to the row's physical location (a "row ID"/"ctid"/"RID," depending on engine) — or, for a clustered index, the leaf is the row itself.
  • Leaf pages are typically linked together in a doubly-linked list, so once you've found your starting point, a range scan (BETWEEN, >, <, ORDER BY) can walk forward/backward through sorted leaves without re-descending the tree each time.

Why B-trees specifically (not a plain binary tree)

A B-tree is wide and shallow rather than narrow and deep — each node holds many keys (often hundreds, sized to match a disk page), so even a huge table needs only 3-4 levels of tree traversal to reach any leaf. This matters because each level touched is potentially a disk I/O (or at least a cache-line/page fetch); minimizing tree height directly minimizes I/O for a lookup. A plain binary tree, by contrast, would need log₂(n) levels — vastly more for large n — because each node holds only one key.

What a B-tree index is good at, and not good at

  • Excellent for: equality lookups (WHERE id = 5), range queries (WHERE price BETWEEN 10 AND 50), sorted retrieval (ORDER BY indexed_col), and prefix matching (WHERE name LIKE 'Smith%').
  • Not useful for: conditions that don't preserve the sorted-order relationship, like WHERE name LIKE '%Smith' (leading wildcard) or applying a function to the indexed column (WHERE UPPER(email) = ...) unless a matching functional/expression index exists — see the sargability question.

Alternatives worth knowing exist

Hash indexes (equality-only, no range support, no sort order — see the hash-vs-B-tree question), GiST/GIN indexes (full-text search, geometric data, JSONB containment in PostgreSQL), and bitmap indexes (low-cardinality columns in analytical/OLAP systems) all trade the B-tree's general-purpose versatility for better performance on a narrower class of queries.

Clustered index — defines physical row order

The table's rows are physically stored sorted by the clustered index's key. There can be only one per table, because rows can only be physically ordered one way.

-- SQL Server: explicit clustered index, often on the primary key by default
CREATE CLUSTERED INDEX ix_orders_id ON orders(id);

When you look up by the clustered key, the engine finds the leaf page and that page is the full row — no second lookup needed.

MySQL/InnoDB specifics: every InnoDB table always has a clustered index — if you declare a PRIMARY KEY, that becomes the clustered index; if you don't, InnoDB creates a hidden 6-byte row ID and clusters on that internally. This is unlike SQL Server, where a table can be a "heap" with no clustered index at all.

PostgreSQL specifics: PostgreSQL doesn't maintain clustering automatically — CLUSTER table USING index physically reorders the table once, but subsequent inserts don't preserve that order unless you re-run CLUSTER. So PostgreSQL tables are effectively heap-organized by default, and "clustered index" in the SQL Server/MySQL sense doesn't map directly.

Non-clustered (secondary) index — a separate lookup structure

CREATE INDEX ix_orders_customer_id ON orders(customer_id);

This index's leaf nodes store customer_id values in sorted order, each paired with a pointer to the actual row — in SQL Server, a row locator (physical page/slot); in InnoDB, the value of the clustered key (id), since InnoDB's secondary indexes always store the primary key rather than a physical address (so that the clustered index can be reorganized without invalidating every secondary index).

The "bookmark lookup" / "key lookup" cost

Querying by a non-clustered index column, but selecting other columns not in that index, requires two steps: (1) traverse the secondary index to find the pointer/key, then (2) go to the clustered index (or heap) to fetch the full row. This second step — a key lookup or bookmark lookup — is where a covering index (see next question) helps by avoiding it entirely.

-- Needs a key lookup: ix_orders_customer_id doesn't contain 'total'
SELECT total FROM orders WHERE customer_id = 42;

Choose the clustered index key (often the primary key) based on the most common range-scan/sequential access pattern for that table — e.g., an auto-incrementing id or a time-ordered column, since sequential inserts into a clustered index avoid the page-splitting overhead that inserting into the middle of a random-order clustered key causes. Add non-clustered indexes for other frequently-filtered/joined columns.

The problem a covering index solves

CREATE INDEX ix_orders_customer_id ON orders(customer_id);

SELECT id, customer_id, total FROM orders WHERE customer_id = 42;

The index on customer_id quickly finds matching rows, but total isn't in that index — so for every matching row, the engine must do an extra key/bookmark lookup back to the full table (or clustered index) just to fetch total. On a query returning many rows, that's many extra random I/Os.

Making the index cover the query

-- Option A: add 'total' as a regular index column
CREATE INDEX ix_orders_customer_covering ON orders(customer_id, total);

-- Option B (SQL Server): INCLUDE non-key columns at the leaf level only
CREATE INDEX ix_orders_customer_covering ON orders(customer_id) INCLUDE (total);

-- Option B (PostgreSQL): the equivalent is INCLUDE in CREATE INDEX (v11+)
CREATE INDEX ix_orders_customer_covering ON orders(customer_id) INCLUDE (total);

Now the same query can be answered entirely from the index's leaf pages — an index-only scan — with no trip to the base table at all, because every column the query needs (customer_id to filter, total to return; id is implicitly available via the clustering key) is present in the index.

Key columns vs INCLUDE(d)/STORING columns

Putting total directly in the index key (Option A) makes it part of the sort order too, which enlarges the key, affects ORDER BY/range-scan usefulness, and duplicates it into every level of the B-tree. An INCLUDE/STORING clause (Option B) stores the extra column only at the leaf level, not in internal nodes, and doesn't affect sort order — generally the more efficient choice when the extra column is only needed for the SELECT list, not for filtering or sorting.

Caveats

  • Covering indexes trade write cost and storage for read speed — every additional column stored means more data duplicated and updated on every write. Don't cover every possible query; reserve this for genuinely hot, high-value queries.
  • SELECT * defeats covering indexes almost by definition — the engine can't predict which columns you'll need in the future, and a covering index for "all columns" is just... the whole table. Covering indexes work best paired with narrow, deliberate SELECT lists.
  • PostgreSQL's index-only scans additionally require the visibility map to confirm a page's rows are all visible to the current transaction — under heavy write/vacuum churn, PostgreSQL can silently fall back to a regular index scan (with key lookups) even when the index technically covers the query.
CREATE INDEX ix_orders_customer_status_date
ON orders(customer_id, status, order_date);

The phone book analogy

Think of this index like a phone book sorted by (last name, first name). You can efficiently find "everyone named Smith," or "everyone named Smith, John" — but you cannot efficiently find "everyone whose first name is John" without scanning the whole book, because first names aren't sorted independent of last name.

Which queries this index helps

-- Efficient: uses the full 3-column prefix
WHERE customer_id = 42 AND status = 'shipped' AND order_date > '2024-01-01'

-- Efficient: uses the leading 2-column prefix (order_date unconstrained is fine)
WHERE customer_id = 42 AND status = 'shipped'

-- Efficient: uses just the leading column
WHERE customer_id = 42

-- NOT efficiently helped: doesn't start with customer_id
WHERE status = 'shipped'
WHERE order_date > '2024-01-01'

-- Partially helped: customer_id can use the index, but status is skipped
-- (this is a "range then unordered" scenario -- order_date range breaks the
-- ability to also use status as a further seek predicate in most engines)
WHERE customer_id = 42 AND order_date > '2024-01-01'

Equality columns before range columns

A useful rule of thumb when deciding column order: put columns used with equality (=) before columns used with a range (>, <, BETWEEN, LIKE 'prefix%'). Once the index encounters a range condition, it can still narrow down using that range, but everything after it in the index can no longer be used to further narrow the search within that range efficiently — the equality columns should exhaust their filtering power first.

-- Good: status (equality) before order_date (range)
CREATE INDEX ix_orders_status_date ON orders(status, order_date);
WHERE status = 'shipped' AND order_date > '2024-01-01'   -- both columns pull weight

-- Worse: order_date (range) before status (equality)
CREATE INDEX ix_orders_date_status ON orders(order_date, status);
WHERE status = 'shipped' AND order_date > '2024-01-01'   -- status can't narrow further after the range scan begins

Design composite indexes around your actual query patterns, leading with the column(s) most consistently filtered by equality across your hottest queries. It's common (and fine) to need multiple composite indexes with different column orderings if your application has several distinct hot query shapes against the same table — but each additional index has a write-cost tradeoff, so don't create one per query without checking for meaningful overlap first.

Definitions

  • Cardinality: the count of distinct values in a column. gender (2-3 values) has low cardinality; email (nearly all unique) has high cardinality.
  • Selectivity: distinct_values / total_rows. A selectivity close to 1 means most values are unique (highly selective — a typical WHERE col = x matches very few rows); a selectivity close to 0 means values repeat heavily (poorly selective — a typical WHERE col = x matches a large fraction of the table).

Why the optimizer cares

An index lookup for a value that matches, say, 1% of rows is a huge win over a full scan — jump straight to the relevant rows. But an index lookup for a value that matches 50% of rows can be worse than a full scan: each matched row potentially requires a separate random-access page fetch (via a key lookup, unless it's a covering or clustered index), whereas a sequential full scan reads pages in order, which is much friendlier to disk/OS-level read-ahead and caching.

-- is_active: 95% of rows are TRUE, 5% are FALSE -- very low selectivity for TRUE
CREATE INDEX ix_users_active ON users(is_active);

SELECT * FROM users WHERE is_active = true;
-- Optimizer likely IGNORES the index and does a full table scan --
-- fetching 95% of the table via random-access index lookups would be slower.

SELECT * FROM users WHERE is_active = false;
-- Optimizer likely USES the index here -- only 5% of rows match.

This is why the same index, on the same column, can be used for one query and ignored for another — the optimizer's decision depends on the specific value's estimated selectivity, not just whether an index technically exists.

How the optimizer knows selectivity

Databases maintain statistics — histograms and distinct-value counts per column, refreshed periodically (ANALYZE in PostgreSQL, auto-updated stats in SQL Server/MySQL, or manually triggered). Stale statistics (e.g., after a bulk load that drastically changes the data distribution) are a common real-world cause of the optimizer picking a bad plan — it's reasoning from an outdated picture of the data. Running ANALYZE (PostgreSQL/MySQL) or UPDATE STATISTICS (SQL Server) after major data changes is a standard troubleshooting step when a previously-fast query suddenly gets a bad plan.

Low-cardinality columns (booleans, small enums) are usually poor standalone index candidates — an index rarely helps if a typical query still matches a large fraction of the table. They can still be useful as a secondary column in a composite index (e.g., (customer_id, is_active)) where the leading column already narrows things down enough that the low-cardinality column just adds a bit more precision within an already-small result set.