What's the difference between DISTINCT and GROUP BY?

3 minbeginnersql-basicsdistinctgroup-by

Quick Answer

`DISTINCT` removes duplicate rows from the final result set based on all selected columns. `GROUP BY` buckets rows into groups based on specified columns, primarily so aggregate functions (`COUNT`, `SUM`, `AVG`, etc.) can be computed per group — deduplication of the grouping columns is a side effect, not its primary purpose. If you're not calling an aggregate function, `SELECT DISTINCT col FROM t` and `SELECT col FROM t GROUP BY col` return the same rows, and the optimizer often executes them identically.

Detailed Answer

-- DISTINCT: unique combinations of the selected columns
SELECT DISTINCT department FROM employees;

-- GROUP BY: same result here, but built for aggregation
SELECT department FROM employees GROUP BY department;

-- GROUP BY's real purpose: per-group aggregates
SELECT department, COUNT(*), AVG(salary)
FROM employees
GROUP BY department;

Where they diverge

DISTINCT applies to every selected column together — it can't aggregate, and it can't return a value for one row that summarizes a group of others:

-- Returns one row per DISTINCT (department, job_title) combination
SELECT DISTINCT department, job_title FROM employees;

GROUP BY lets you select the grouping column(s) plus arbitrary aggregate expressions over each group — something DISTINCT simply cannot do:

SELECT department, job_title, COUNT(*) AS headcount, MAX(salary) AS top_salary
FROM employees
GROUP BY department, job_title;

Attempting to SELECT a non-aggregated, non-grouped column alongside GROUP BY is a functional-dependency violation that most engines reject (ONLY_FULL_GROUP_BY in MySQL) or, in older/lenient MySQL modes, silently returns an arbitrary value from the group — a frequent source of subtly wrong reports.

Performance

When there's no aggregate function involved, SELECT DISTINCT col FROM t and SELECT col FROM t GROUP BY col typically produce identical execution plans — both need some form of sort or hash-based deduplication, and query optimizers usually recognize the equivalence. Don't assume one is inherently faster than the other without checking EXPLAIN on your specific engine and data.