What is SQL, and what do DDL, DML, DCL, and TCL stand for?

SQL (Structured Query Language) is the standard declarative language for defining and manipulating relational data. Its statements fall into four categories: **DDL** (Data Definition Language — `CREATE`, `ALTER`, `DROP`) changes schema structure; **DML** (Data Manipulation Language — `SELECT`, `INSERT`, `UPDATE`, `DELETE`) reads and writes rows; **DCL** (Data Control Language — `GRANT`, `REVOKE`) manages permissions; **TCL** (Transaction Control Language — `COMMIT`, `ROLLBACK`, `SAVEPOINT`) manages transaction boundaries.

What is the logical order of execution of a SQL SELECT statement's clauses?

SQL is declarative, but the engine conceptually processes clauses as: `FROM`/`JOIN` → `WHERE` → `GROUP BY` → `HAVING` → `SELECT` (including window functions) → `DISTINCT` → `ORDER BY` → `LIMIT`/`OFFSET`. This is why you can't reference a `SELECT` alias in `WHERE`, but you can in `ORDER BY` — the alias doesn't exist yet when `WHERE` runs.

What's the difference between WHERE and HAVING?

`WHERE` filters individual rows *before* grouping/aggregation happens, and cannot reference aggregate functions. `HAVING` filters *groups* after `GROUP BY` has run, and is specifically for conditions on aggregates like `COUNT(*)` or `SUM(x)`. If a query has no `GROUP BY`, `HAVING` treats the whole result set as one group.

How does NULL and three-valued logic work in SQL?

SQL uses three-valued logic: `TRUE`, `FALSE`, and `UNKNOWN`. Any comparison involving `NULL` (e.g., `x = NULL`, `x <> NULL`) evaluates to `UNKNOWN`, not `TRUE` or `FALSE` — which is why you must use `IS NULL`/`IS NOT NULL` instead of `= NULL`. Rows are only included by `WHERE`/`HAVING` when the condition is `TRUE`; both `FALSE` and `UNKNOWN` exclude the row, which trips people up in `NOT IN` queries containing NULLs.

What's the difference between DELETE, TRUNCATE, and DROP?

`DELETE` removes rows one at a time (optionally filtered by `WHERE`), is fully transactional/logged, and fires triggers. `TRUNCATE` deallocates all rows at once by resetting the table's storage, is much faster, typically resets identity/auto-increment counters, but can't be filtered and often can't be rolled back or may implicitly commit depending on the engine. `DROP` removes the entire table object — data, schema, indexes, constraints — permanently.

What's the difference between UNION and UNION ALL?

Both combine the result sets of two or more `SELECT` statements with the same number/type of columns. `UNION` removes duplicate rows across the combined set, which requires an implicit sort or hash-based dedup step. `UNION ALL` keeps every row, including duplicates, and is significantly cheaper since it skips deduplication. Prefer `UNION ALL` whenever you know the inputs are already disjoint or duplicates are acceptable.

What do INTERSECT and EXCEPT (or MINUS) do?

`INTERSECT` returns only rows that appear in *both* result sets (deduplicated). `EXCEPT` (called `MINUS` in Oracle) returns rows from the first query that do *not* appear in the second. Both require the same column count/types as `UNION`, and both deduplicate by default. They're less commonly supported and used than joins/`EXISTS`, but are the most direct way to express set comparisons.

What's the difference between CHAR, VARCHAR, and TEXT data types?

`CHAR(n)` is fixed-length — the engine pads shorter values with trailing spaces up to `n`, always uses `n` characters of storage. `VARCHAR(n)` is variable-length up to a max of `n`, storing only the actual content plus a small length prefix. `TEXT` (or `VARCHAR(MAX)`/unbounded `VARCHAR` depending on engine) stores arbitrarily long variable-length data, sometimes with different storage/indexing behavior than bounded `VARCHAR`.

How do you write a CASE expression, and where can it be used?

`CASE` is SQL's inline conditional expression — it evaluates a sequence of `WHEN condition THEN result` branches (optionally a final `ELSE`) and returns a single value. It can appear anywhere a value expression is allowed: `SELECT`, `WHERE`, `ORDER BY`, `GROUP BY`, and inside aggregate functions for conditional aggregation.

What's the difference between a subquery, a derived table, and a CTE?

A **subquery** is any `SELECT` nested inside another statement (in `WHERE`, `SELECT`, or `FROM`). A **derived table** is specifically a subquery used in the `FROM` clause, given an alias, and treated as a temporary named result set. A **CTE** (`WITH name AS (...)`) is a named, top-level query that can be referenced one or more times in the main query, and can optionally be recursive — mostly a readability/reusability improvement over a derived table, though optimizer behavior around materialization varies by engine.

What is a correlated subquery, and how does it differ from a non-correlated one?

A **non-correlated subquery** runs once, independently, and its result is reused for every row of the outer query. A **correlated subquery** references a column from the outer query, so it conceptually re-executes once per outer row (though optimizers often rewrite it into a join to avoid literally running it row by row). Correlated subqueries are essential for per-row comparisons ("find the latest order for each customer") but can be a performance trap if the optimizer can't rewrite them.

What's the difference between DISTINCT and GROUP BY?

`DISTINCT` removes duplicate rows from the final result set based on all selected columns. `GROUP BY` buckets rows into groups based on specified columns, primarily so aggregate functions (`COUNT`, `SUM`, `AVG`, etc.) can be computed per group — deduplication of the grouping columns is a side effect, not its primary purpose. If you're not calling an aggregate function, `SELECT DISTINCT col FROM t` and `SELECT col FROM t GROUP BY col` return the same rows, and the optimizer often executes them identically.

SQL Fundamentals and Query Basics

Core SQL syntax, statement categories, NULL handling, and the building blocks every query is made of.

Difficulty

Open as page

SQL is split into sub-languages by what kind of change a statement makes, and that split matters in practice because it determines things like transaction behavior, required privileges, and whether an operation can be rolled back.

The four categories

Category	Full name	Example statements	What it affects
DDL	Data Definition Language	`CREATE TABLE`, `ALTER TABLE`, `DROP TABLE`, `TRUNCATE`	Schema/structure (tables, indexes, constraints)
DML	Data Manipulation Language	`SELECT`, `INSERT`, `UPDATE`, `DELETE`	Row-level data
DCL	Data Control Language	`GRANT`, `REVOKE`	Permissions and access control
TCL	Transaction Control Language	`COMMIT`, `ROLLBACK`, `SAVEPOINT`, `SET TRANSACTION`	Transaction boundaries

-- DDL: defines structure
CREATE TABLE accounts (
    id SERIAL PRIMARY KEY,
    balance NUMERIC(12,2) NOT NULL DEFAULT 0
);

-- DML: manipulates rows
INSERT INTO accounts (balance) VALUES (100.00);
UPDATE accounts SET balance = balance - 50 WHERE id = 1;

-- DCL: controls access
GRANT SELECT, INSERT ON accounts TO app_user;

-- TCL: controls the transaction
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
UPDATE accounts SET balance = balance + 50 WHERE id = 2;
COMMIT;

Why the distinction matters

Most production databases auto-commit DDL (or even implicitly commit any open transaction before running it) — running ALTER TABLE mid-transaction in MySQL, for example, causes an implicit commit, so you can't roll a schema change back the way you can an UPDATE. PostgreSQL is a notable exception: it supports transactional DDL, so a CREATE TABLE inside a BEGIN...ROLLBACK block really does disappear.

DCL statements are also typically not transactional in the same sense — permission changes often take effect immediately and aren't undone by ROLLBACK in many engines. Knowing which bucket a statement falls into tells you whether you can safely wrap it in a transaction for an atomic migration, or whether you need a different rollback strategy (e.g., a paired "down" migration script).

Related Resources

SQL Command Categories (PostgreSQL docs)

Open as page

You write a SELECT statement top to bottom, but the database evaluates it in a different logical order. Understanding that order explains several rules that otherwise look arbitrary.

The logical order

1. FROM / JOIN     -- build the base row set
2. WHERE           -- filter individual rows
3. GROUP BY        -- bucket remaining rows into groups
4. HAVING          -- filter groups
5. SELECT          -- compute output expressions (incl. window functions)
6. DISTINCT        -- remove duplicate output rows
7. ORDER BY        -- sort the result
8. LIMIT / OFFSET  -- take a slice

Note this is the logical order — real optimizers reorder physical execution (e.g., pushing a WHERE predicate down before a join, or picking an index that satisfies ORDER BY for free) as long as the observable result is identical.

Why this explains common gotchas

You can't use a SELECT alias in WHERE:

-- Fails: "total" doesn't exist yet when WHERE is evaluated
SELECT price * quantity AS total FROM orders WHERE total > 100;

-- Works: recompute the expression, or move the filter to HAVING with a subquery/CTE
SELECT * FROM (
    SELECT price * quantity AS total FROM orders
) t WHERE total > 100;

But you can use it in ORDER BY — by the time ORDER BY runs, SELECT has already executed and the alias exists:

SELECT price * quantity AS total FROM orders ORDER BY total DESC;

WHERE can't filter on aggregates, HAVING can:

-- Fails: WHERE runs before GROUP BY, so COUNT(*) doesn't exist yet
SELECT customer_id, COUNT(*) FROM orders WHERE COUNT(*) > 5 GROUP BY customer_id;

-- Correct: HAVING runs after grouping
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id HAVING COUNT(*) > 5;

Window functions see SELECT-time data, not raw rows — they run after WHERE/GROUP BY/HAVING but before DISTINCT/ORDER BY/LIMIT, which is why you generally can't reference a window function's result directly in the same SELECT's WHERE clause (you need to wrap it in a subquery or CTE and filter in the outer query instead).

Related Resources

PostgreSQL: The SELECT Statement

Open as page

Both clauses filter rows, but at different stages of query processing (see the logical execution order: FROM → WHERE → GROUP BY → HAVING → SELECT).

WHERE: filters rows before grouping

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'   -- filters individual employee rows first
GROUP BY department;

HAVING: filters groups after aggregation

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 75000;       -- filters the resulting department groups

Combining both

SELECT department, COUNT(*) AS headcount
FROM employees
WHERE status = 'active'           -- row-level filter first
GROUP BY department
HAVING COUNT(*) >= 10;            -- group-level filter second

The rule of thumb

Use WHERE for conditions on raw column values that exist before grouping.
Use HAVING for conditions on the result of an aggregate function.
Trying to put an aggregate condition in WHERE (WHERE COUNT(*) > 5) fails, because COUNT(*) doesn't exist until GROUP BY has executed.
Filtering earlier with WHERE is almost always more efficient than filtering later with HAVING, because WHERE reduces the row set before the (often expensive) grouping and aggregation work happens. Never use HAVING for a condition that could be expressed in WHERE.

Related Resources

MySQL: SELECT ... HAVING

Open as page

NULL represents "unknown" or "missing," not a value like zero or an empty string — and that distinction drives SQL's three-valued logic.

The three truth values

Any predicate evaluates to TRUE, FALSE, or UNKNOWN. NULL compared to anything — including another NULL — produces UNKNOWN:

SELECT NULL = NULL;      -- UNKNOWN (not TRUE!)
SELECT NULL <> NULL;     -- UNKNOWN
SELECT 5 = NULL;         -- UNKNOWN
SELECT 5 <> NULL;        -- UNKNOWN

WHERE and HAVING only keep rows where the condition is TRUE — UNKNOWN is treated the same as FALSE for filtering purposes (but differently for NOT, see below).

Testing for NULL correctly

-- Wrong: always UNKNOWN, matches nothing
SELECT * FROM users WHERE middle_name = NULL;

-- Correct
SELECT * FROM users WHERE middle_name IS NULL;
SELECT * FROM users WHERE middle_name IS NOT NULL;

The classic NOT IN trap

This is the single most common NULL bug in production SQL:

-- If banned_ids contains even one NULL, this returns ZERO rows,
-- because "x NOT IN (1, 2, NULL)" expands to
-- "x <> 1 AND x <> 2 AND x <> NULL", and the last comparison is UNKNOWN,
-- which poisons the whole AND chain to UNKNOWN.
SELECT * FROM users WHERE id NOT IN (SELECT banned_id FROM bans);

Fix it with NOT EXISTS (which handles NULLs correctly) or by filtering NULLs out of the subquery explicitly:

SELECT * FROM users u
WHERE NOT EXISTS (SELECT 1 FROM bans b WHERE b.banned_id = u.id);

Useful NULL-handling functions

COALESCE(a, b, c)     -- returns the first non-NULL argument
NULLIF(a, b)          -- returns NULL if a = b, otherwise returns a

-- Avoid division by zero producing an error, return NULL instead
SELECT revenue / NULLIF(units_sold, 0) AS avg_price FROM sales;

-- Provide a default for a possibly-NULL column
SELECT COALESCE(nickname, first_name) AS display_name FROM users;

Also remember that aggregate functions like SUM, AVG, COUNT(column) all ignore NULLs silently — COUNT(*) counts rows, but COUNT(column) counts only non-NULL values in that column, which is a frequent source of off-by-some-amount bugs.

Related Resources

PostgreSQL: Handling NULL Values

Open as page

All three remove data or structure, but at very different granularities and costs.

	`DELETE`	`TRUNCATE`	`DROP`
Removes	Matching rows (or all rows)	All rows	The whole table object
Can filter with `WHERE`	Yes	No	N/A
Speed	Slow for large tables (row-by-row logging)	Fast (deallocates pages)	Fast
Fires triggers	Yes	Usually no	No
Resets identity/auto-increment	No	Usually yes	N/A (table gone)
Transactional / rollback-able	Yes, fully	Depends on engine (PostgreSQL: yes; MySQL/InnoDB: implicit commit)	Depends on engine (same caveat)
Table structure afterward	Unchanged, empty or filtered	Unchanged, empty	Table no longer exists
Category	DML	DDL (in most engines)	DDL

-- DELETE: row-level, filterable, fully logged
DELETE FROM orders WHERE status = 'cancelled';

-- TRUNCATE: removes everything, resets the table's storage
TRUNCATE TABLE orders;

-- DROP: the table itself is gone
DROP TABLE orders;

Why TRUNCATE is faster

DELETE scans and removes rows individually, writing an entry to the transaction/redo log per row (or per page) so it can be rolled back and so triggers can fire per row. TRUNCATE instead deallocates the data pages that back the table wholesale — it's closer to a DDL operation than a DML one, which is why many engines treat it as non-transactional or auto-committing.

Need to remove a subset of rows, want triggers to fire, or need it fully rollback-able mid-transaction → DELETE.
Need to empty a whole table fast (e.g., clearing a staging table between ETL runs) and don't need row-level rollback → TRUNCATE.
Need to remove the table definition entirely, including its indexes and constraints → DROP.
Be careful with TRUNCATE on tables referenced by foreign keys — most engines refuse to truncate a table that's the target of an active foreign key from another table unless you cascade or disable the constraint first.

Related Resources

PostgreSQL: TRUNCATE

Open as page

-- UNION: deduplicates the combined result set
SELECT city FROM current_customers
UNION
SELECT city FROM past_customers;

-- UNION ALL: keeps every row, including duplicates
SELECT city FROM current_customers
UNION ALL
SELECT city FROM past_customers;

Requirements

Every SELECT in the union must produce the same number of columns, in the same order, with compatible data types. Column names in the output come from the first SELECT.

Performance

UNION has to identify and remove duplicates, which typically means sorting the combined result or building a hash set — an extra pass that costs memory and CPU proportional to the result size. UNION ALL just concatenates the result sets with no extra work.

Rule of thumb: default to UNION ALL unless you specifically need deduplication. A common mistake is reflexively using UNION "to be safe" on two queries that can never produce overlapping rows (e.g., querying two mutually exclusive partitions), paying for a sort that can never actually remove anything.

Ordering the final result

ORDER BY can only appear once, at the end, and applies to the combined result:

SELECT city, 'current' AS source FROM current_customers
UNION ALL
SELECT city, 'past' AS source FROM past_customers
ORDER BY city;

Related Resources

PostgreSQL: Combining Queries

Open as page

INTERSECT and EXCEPT are the other two set operators alongside UNION, each with the same column-count/type requirements.

-- INTERSECT: rows present in BOTH result sets
SELECT customer_id FROM orders_2023
INTERSECT
SELECT customer_id FROM orders_2024;
-- customers who ordered in both years

-- EXCEPT (Oracle: MINUS): rows in the first set but NOT the second
SELECT customer_id FROM orders_2023
EXCEPT
SELECT customer_id FROM orders_2024;
-- customers who ordered in 2023 but churned before 2024

Equivalent using joins/EXISTS

These set operators are often more readable than the join-based equivalent, but the optimizer typically rewrites them to a semi-join or anti-join internally anyway:

-- Equivalent to the INTERSECT above
SELECT DISTINCT o23.customer_id
FROM orders_2023 o23
WHERE EXISTS (SELECT 1 FROM orders_2024 o24 WHERE o24.customer_id = o23.customer_id);

-- Equivalent to the EXCEPT above
SELECT DISTINCT o23.customer_id
FROM orders_2023 o23
WHERE NOT EXISTS (SELECT 1 FROM orders_2024 o24 WHERE o24.customer_id = o23.customer_id);

Support across engines

PostgreSQL, SQL Server, SQLite: INTERSECT and EXCEPT
Oracle: INTERSECT and MINUS (not EXCEPT)
MySQL: added INTERSECT and EXCEPT in 8.0.31+; earlier versions require rewriting as JOIN/EXISTS

Both operators deduplicate by default, and (like UNION) most engines also support an ALL variant (INTERSECT ALL, EXCEPT ALL) that preserves duplicate counts using multiset semantics, though these are used far less often.

Related Resources

PostgreSQL: Combining Queries

Open as page

CREATE TABLE example (
    country_code CHAR(2),      -- always exactly 2 chars, e.g. 'US', 'GB'
    username     VARCHAR(50),  -- up to 50 chars, stores only what's used
    biography    TEXT          -- arbitrarily long
);

CHAR(n) — fixed length

Always consumes storage for exactly n characters; shorter values are right-padded with spaces (trailing spaces are typically stripped on read, depending on engine).
Best for values that are genuinely always the same length: fixed codes like ISO country codes, US state abbreviations, MD5 hex hashes.
Slightly faster comparisons in some engines because rows are uniformly sized, simplifying row offset math — but this rarely matters in practice compared to correct data modeling.

VARCHAR(n) — variable length, bounded

Stores only the actual bytes used, plus 1–2 bytes of length prefix.
The (n) is a maximum, enforced at insert/update time — it's a constraint, not pre-allocated storage.
The right default for most string columns: names, emails, addresses, titles.

TEXT / unbounded — variable length, no practical cap

PostgreSQL: TEXT has no length limit and, importantly, has no performance penalty vs VARCHAR(n) — internally they use the same storage mechanism (TOAST for large values), so PostgreSQL docs actually recommend TEXT with a CHECK constraint over VARCHAR(n) for flexibility.
MySQL/SQL Server: TEXT (or NVARCHAR(MAX)) historically had different storage (often off-page/BLOB-like) and couldn't always be indexed the same way as VARCHAR, or required a prefix index. This has narrowed in modern versions but still varies — check your engine's docs before assuming TEXT behaves identically to VARCHAR(MAX).

Pick VARCHAR(n) with a sensible max when there's a genuine business-rule length limit (e.g., a 100-character product name) so the constraint documents intent and catches bad data early. Use TEXT for genuinely unbounded content (article bodies, JSON blobs, logs). Avoid CHAR(n) unless the value truly always has that exact length — using it for a "mostly short" string wastes storage and requires careful trimming on comparison.

Related Resources

PostgreSQL: Character Types

Open as page

Basic syntax

SELECT
    order_id,
    CASE
        WHEN total > 1000 THEN 'large'
        WHEN total > 100  THEN 'medium'
        ELSE 'small'
    END AS order_size
FROM orders;

Conditions are checked top to bottom; the first matching WHEN wins. ELSE is optional — if omitted and no branch matches, the result is NULL.

There's also a simpler "simple CASE" form for equality checks against one expression:

SELECT
    CASE status
        WHEN 'A' THEN 'Active'
        WHEN 'I' THEN 'Inactive'
        ELSE 'Unknown'
    END AS status_label
FROM users;

Conditional aggregation — the most powerful use case

Combining CASE with an aggregate lets you pivot conditional counts/sums into columns without a separate query per condition:

SELECT
    department,
    COUNT(*) AS total_employees,
    SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS female_count,
    SUM(CASE WHEN salary > 100000 THEN 1 ELSE 0 END) AS high_earners,
    AVG(CASE WHEN hire_date >= '2023-01-01' THEN salary END) AS avg_new_hire_salary
FROM employees
GROUP BY department;

Note the last example: AVG(CASE WHEN ... THEN salary END) (no ELSE) — rows that don't match the condition contribute NULL, and aggregates like AVG/SUM/COUNT(column) ignore NULLs, so this correctly computes the average only over matching rows without needing a separate WHERE-filtered query.

In ORDER BY — custom sort order

SELECT * FROM tickets
ORDER BY
    CASE priority
        WHEN 'critical' THEN 1
        WHEN 'high'     THEN 2
        WHEN 'medium'   THEN 3
        ELSE 4
    END;

This is the standard trick for sorting by a business-meaningful order that doesn't match alphabetical or numeric order of the raw column.

In WHERE / GROUP BY

CASE can also drive filtering or bucketing logic:

SELECT
    CASE WHEN age < 18 THEN 'minor' WHEN age < 65 THEN 'adult' ELSE 'senior' END AS age_bucket,
    COUNT(*)
FROM people
GROUP BY 1;  -- most engines allow grouping by output column position

Related Resources

PostgreSQL: CASE

Open as page

These three terms describe overlapping ways of nesting one query inside another, distinguished mainly by where they appear and how reusable they are.

Subquery — the general term

Any SELECT nested inside another SQL statement:

-- In WHERE (scalar/list subquery)
SELECT * FROM employees WHERE department_id IN (SELECT id FROM departments WHERE region = 'EU');

-- In SELECT (scalar subquery, must return exactly one value)
SELECT name, (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_count
FROM customers c;

Derived table — a subquery used as a table

When a subquery appears in FROM and is given an alias, it's specifically called a derived table (or "inline view"):

SELECT dept_avg.department, dept_avg.avg_salary
FROM (
    SELECT department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
) AS dept_avg          -- <-- the alias is mandatory in most engines
WHERE dept_avg.avg_salary > 80000;

Derived tables must be aliased, are scoped only to that one query, and can't be referenced more than once without repeating the whole subquery.

CTE — a named, top-level, (optionally) reusable query

WITH dept_avg AS (
    SELECT department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
)
SELECT e.name, e.salary, d.avg_salary
FROM employees e
JOIN dept_avg d ON e.department = d.department
WHERE e.salary > d.avg_salary;

Advantages over a derived table:

Readability — complex queries read top-to-bottom instead of nesting inward.
Reusability within one statement — the same CTE can be joined against multiple times without repeating its definition.
Recursion — WITH RECURSIVE lets a CTE reference itself, which is impossible with a plain derived table (used for hierarchies/graphs — see the recursive CTE question).

Materialization caveat

Historically, PostgreSQL always materialized (fully computed and stored) CTE results before using them, which could hurt performance versus an equivalent derived table the optimizer could inline and push predicates into. As of PostgreSQL 12, CTEs are inlined by default (like a derived table) unless marked MATERIALIZED, referenced multiple times, or recursive. Always check your engine's current CTE optimization behavior rather than assuming — it's one of the more version-sensitive areas of SQL.

Related Resources

PostgreSQL: WITH Queries (CTEs)

Open as page

Non-correlated subquery

Self-contained — it doesn't reference anything from the outer query, so it can be evaluated exactly once:

SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

The inner SELECT AVG(salary) FROM employees computes one number, once, regardless of how many outer rows there are.

Correlated subquery

References a column from the outer query, so its result depends on which outer row is currently being evaluated:

SELECT e.name, e.salary, e.department
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.department = e.department   -- <-- correlated: references outer e.department
);

Here, the inner query's result (average salary) is different for every department, so conceptually it must be recomputed for each outer row.

Common use case: "latest/top row per group"

SELECT o.*
FROM orders o
WHERE o.order_date = (
    SELECT MAX(o2.order_date)
    FROM orders o2
    WHERE o2.customer_id = o.customer_id   -- correlated
);

This finds each customer's most recent order — a pattern that's hard to express without either a correlated subquery, a window function (ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC)), or a GROUP BY + join.

Performance implications

A naive mental model — "the subquery runs once per outer row" — describes correctness, not necessarily execution. Modern optimizers frequently rewrite correlated subqueries into semi-joins or anti-joins (especially ones using EXISTS/IN/NOT EXISTS), avoiding literal row-by-row re-execution. But not all correlated subqueries can be rewritten this way, and on an engine/query shape where the optimizer can't flatten it, you do get an actual nested-loop-style re-execution per outer row — which is O(n×m) and can be dramatically slower than an equivalent join or window function on large tables.

Practical guidance: correlated subqueries are fine and often the clearest way to express "compare each row to something computed from its own group." But when you hit a performance problem with one on a large table, check the execution plan first — a rewrite to a JOIN with GROUP BY, or a window function, is often available and can be an order of magnitude faster.

Related Resources

PostgreSQL: Subquery Expressions

Open as page

-- DISTINCT: unique combinations of the selected columns
SELECT DISTINCT department FROM employees;

-- GROUP BY: same result here, but built for aggregation
SELECT department FROM employees GROUP BY department;

-- GROUP BY's real purpose: per-group aggregates
SELECT department, COUNT(*), AVG(salary)
FROM employees
GROUP BY department;

Where they diverge

DISTINCT applies to every selected column together — it can't aggregate, and it can't return a value for one row that summarizes a group of others:

-- Returns one row per DISTINCT (department, job_title) combination
SELECT DISTINCT department, job_title FROM employees;

GROUP BY lets you select the grouping column(s) plus arbitrary aggregate expressions over each group — something DISTINCT simply cannot do:

SELECT department, job_title, COUNT(*) AS headcount, MAX(salary) AS top_salary
FROM employees
GROUP BY department, job_title;

Attempting to SELECT a non-aggregated, non-grouped column alongside GROUP BY is a functional-dependency violation that most engines reject (ONLY_FULL_GROUP_BY in MySQL) or, in older/lenient MySQL modes, silently returns an arbitrary value from the group — a frequent source of subtly wrong reports.

Performance

When there's no aggregate function involved, SELECT DISTINCT col FROM t and SELECT col FROM t GROUP BY col typically produce identical execution plans — both need some form of sort or hash-based deduplication, and query optimizers usually recognize the equivalence. Don't assume one is inherently faster than the other without checking EXPLAIN on your specific engine and data.

Related Resources

PostgreSQL: GROUP BY and HAVING