What is index selectivity/cardinality, and why does the optimizer care?

Detailed Answer

Definitions

Cardinality: the count of distinct values in a column. gender (2-3 values) has low cardinality; email (nearly all unique) has high cardinality.
Selectivity: distinct_values / total_rows. A selectivity close to 1 means most values are unique (highly selective — a typical WHERE col = x matches very few rows); a selectivity close to 0 means values repeat heavily (poorly selective — a typical WHERE col = x matches a large fraction of the table).

Why the optimizer cares

An index lookup for a value that matches, say, 1% of rows is a huge win over a full scan — jump straight to the relevant rows. But an index lookup for a value that matches 50% of rows can be worse than a full scan: each matched row potentially requires a separate random-access page fetch (via a key lookup, unless it's a covering or clustered index), whereas a sequential full scan reads pages in order, which is much friendlier to disk/OS-level read-ahead and caching.

-- is_active: 95% of rows are TRUE, 5% are FALSE -- very low selectivity for TRUE
CREATE INDEX ix_users_active ON users(is_active);

SELECT * FROM users WHERE is_active = true;
-- Optimizer likely IGNORES the index and does a full table scan --
-- fetching 95% of the table via random-access index lookups would be slower.

SELECT * FROM users WHERE is_active = false;
-- Optimizer likely USES the index here -- only 5% of rows match.

This is why the same index, on the same column, can be used for one query and ignored for another — the optimizer's decision depends on the specific value's estimated selectivity, not just whether an index technically exists.

How the optimizer knows selectivity

Databases maintain statistics — histograms and distinct-value counts per column, refreshed periodically (ANALYZE in PostgreSQL, auto-updated stats in SQL Server/MySQL, or manually triggered). Stale statistics (e.g., after a bulk load that drastically changes the data distribution) are a common real-world cause of the optimizer picking a bad plan — it's reasoning from an outdated picture of the data. Running ANALYZE (PostgreSQL/MySQL) or UPDATE STATISTICS (SQL Server) after major data changes is a standard troubleshooting step when a previously-fast query suddenly gets a bad plan.

Low-cardinality columns (booleans, small enums) are usually poor standalone index candidates — an index rarely helps if a typical query still matches a large fraction of the table. They can still be useful as a secondary column in a composite index (e.g., (customer_id, is_active)) where the leading column already narrows things down enough that the low-cardinality column just adds a bit more precision within an already-small result set.

What is index selectivity/cardinality, and why does the optimizer care?

Quick Answer

Detailed Answer

Definitions

Why the optimizer cares

How the optimizer knows selectivity

Related Resources

Related Questions