What are window functions, and how do they differ from GROUP BY aggregation?

A window function computes a value across a set of related rows (a "window") *without collapsing them into a single output row* — unlike `GROUP BY`, which reduces N rows to one row per group. Syntax uses `OVER (...)` after an aggregate or ranking function, optionally with `PARTITION BY` (grouping without collapsing) and `ORDER BY` (defining row order within the window, enabling running calculations).

Explain the differences between ROW_NUMBER(), RANK(), and DENSE_RANK()

All three assign a sequential position based on `ORDER BY` within a window, but they handle ties differently. `ROW_NUMBER()` gives every row a unique, sequential number regardless of ties (arbitrarily breaking them). `RANK()` gives tied rows the same rank, then skips the next rank number(s) by the count of ties. `DENSE_RANK()` gives tied rows the same rank too, but never skips subsequent rank numbers.

What is a recursive CTE, and when would you use one?

A recursive CTE (`WITH RECURSIVE`) lets a query reference itself, repeatedly building on prior results until no new rows are produced — the standard way to traverse hierarchical or graph-like data (org charts, category trees, bill-of-materials, dependency graphs) that a fixed number of joins can't handle because the depth is unknown or variable.

What is a materialized view, and how does it differ from a regular view?

A regular `VIEW` is just a saved, named query — it has no storage of its own and re-executes its underlying query every time it's referenced. A `MATERIALIZED VIEW` actually stores the query's result set physically on disk, so reading it is as fast as reading a regular table, but the stored data goes stale until explicitly (or, on some engines, automatically) refreshed. The tradeoff is always the same: a materialized view trades read speed for staleness and refresh cost, while a regular view trades neither but pays the full underlying query cost on every read.

How would you pivot rows into columns in SQL?

The portable, engine-agnostic technique is conditional aggregation: `GROUP BY` the row-identifying column, and use `CASE WHEN` inside an aggregate function (like `SUM` or `MAX`) once per desired output column. Some engines also offer dedicated syntax — SQL Server's `PIVOT` operator, or `crosstab()` in PostgreSQL's `tablefunc` extension — but conditional aggregation works everywhere and doesn't require knowing the pivoted values in advance to write the query structure.

What do LAG() and LEAD() do, and what problems do they solve?

`LAG(column, n)` returns the value of `column` from `n` rows *before* the current row (within the window's partition/order); `LEAD(column, n)` returns the value from `n` rows *after*. Both default to `n = 1` and accept an optional default value for when there's no such row (e.g., the first row has no previous row). They solve row-to-row comparison problems — period-over-period change, detecting gaps, sequencing — without a self-join.

What's the difference between a view and a table, and when do views help or hurt performance?

A table stores actual data; a view is a saved, named `SELECT` statement with no storage of its own — querying a view is equivalent to running its underlying query, substituted inline wherever the view is referenced. Views help readability, security (restricting which columns/rows a user can see), and centralizing business logic, but provide no performance benefit by themselves — a view built on an expensive, unindexed join is exactly as slow as running that join directly, every time.

How do you compute a running total or moving average in SQL?

Use a window function's aggregate form with an explicit frame clause: `SUM(x) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)` for a running total, or `AVG(x) OVER (ORDER BY date ROWS BETWEEN N PRECEDING AND CURRENT ROW)` for a moving average over the last N rows. The frame clause is what makes this different from a simple partitioned aggregate — it defines exactly which neighboring rows are included relative to the current row.

Advanced SQL: Window Functions, CTEs, and Views

Analytical SQL — window functions, recursive queries, pivoting, and views vs. materialized views.

Difficulty

Open as page

The key distinction from GROUP BY

-- GROUP BY: collapses many rows into one row per department
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
-- Output: one row PER department -- individual employee rows are gone.

-- Window function: keeps every employee row, adds a computed column
SELECT name, department, salary,
       AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary
FROM employees;
-- Output: one row PER EMPLOYEE, each annotated with their department's average.

This is the core value proposition: you can compare an individual row to an aggregate of its group (salary vs. dept_avg_salary) in the same row, something plain GROUP BY can't do without a self-join or subquery.

Anatomy of the OVER clause

function_name(...) OVER (
    [PARTITION BY column1, column2, ...]   -- groups rows, like GROUP BY, but doesn't collapse them
    [ORDER BY column3, ...]                -- defines row order within each partition
    [ROWS/RANGE BETWEEN ... AND ...]       -- defines the "frame" -- which rows within the partition to include
)

PARTITION BY — optional; without it, the whole result set is one partition. Divides rows into groups for the function to operate over, analogous to GROUP BY but without reducing row count.
ORDER BY (inside OVER) — defines the order used for ranking functions (ROW_NUMBER, RANK) and for frame-relative functions (LAG, running totals). This is independent of the query's outer ORDER BY.
Frame clause (ROWS BETWEEN ...) — defines exactly which rows within the partition the function sees relative to the current row (e.g., "from the start of the partition to the current row," for a running total).

Common window functions

ROW_NUMBER() OVER (...)         -- sequential number, no ties
RANK() OVER (...)               -- rank with gaps after ties
DENSE_RANK() OVER (...)         -- rank without gaps after ties
LAG(col, n) OVER (...)          -- value from n rows before the current row
LEAD(col, n) OVER (...)         -- value from n rows after the current row
SUM(col) OVER (...)             -- running/partitioned sum (not collapsed)
AVG(col) OVER (...)             -- running/partitioned average
FIRST_VALUE(col) OVER (...)     -- first value in the frame

Why this matters

Window functions are the standard, efficient way to express "rank within group," "running total," "percent of group total," or "compare to previous row" — all queries that, before window functions existed, required awkward self-joins or correlated subqueries (see the correlated subquery and self-join questions) that were both harder to read and often slower. They execute logically after WHERE/GROUP BY/HAVING but before ORDER BY/LIMIT in the query's execution order (see the execution order question), which is why you generally can't filter directly on a window function's result in the same query's WHERE clause without wrapping it in a subquery or CTE.

Related Resources

PostgreSQL: Window Functions

Open as page

Side-by-side example

SELECT
    name, score,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
    RANK()       OVER (ORDER BY score DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM contestants;

name	score	row_num	rank	dense_rank
Alice	95	1	1	1
Bob	90	2	2	2
Carol	90	3	2	2
Dave	85	4	4	3

ROW_NUMBER(): Bob and Carol are tied at 90, but get distinct numbers (2, 3) anyway — the tie is broken arbitrarily (or deterministically, if you add more ORDER BY columns to fully disambiguate). No two rows ever share a ROW_NUMBER().
RANK(): Bob and Carol both get rank 2 (tied), and Dave — the next distinct value — gets rank 4, skipping 3 entirely, because two rows already "used up" ranks 2 and 3.
DENSE_RANK(): Bob and Carol both get rank 2, and Dave gets rank 3 — no gap, because dense rank only increments for each distinct value encountered, not for each row.

When to use which

ROW_NUMBER(): when you need a strictly unique sequential identifier per row regardless of ties — e.g., picking exactly one "first" row per group (deduplication, pagination, "keep only the latest record per customer").
RANK(): when ties should share a rank, and you want the rank to reflect "how many rows are strictly better," including the tie count — e.g., a leaderboard where two people tied for 2nd place means the next person really is in "4th place" by count.
DENSE_RANK(): when ties should share a rank, but you want ranks to be a compact, gapless sequence — e.g., "top 3 distinct price tiers," where you care about distinct values, not row counts.

A common use: "top N per group"

SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn <= 3;   -- top 3 highest-paid employees PER department

This pattern (ROW_NUMBER() + PARTITION BY + filter in an outer query) is one of the most common real-world uses of window functions, and a frequent live-coding interview exercise.

Related Resources

PostgreSQL: Window Functions

Open as page

The problem: unknown-depth hierarchies

-- employees: id, name, manager_id

Finding "all of Alice's direct reports" is a simple join. Finding "everyone in Alice's entire management chain below her, at any depth" can't be expressed with a fixed number of joins, because the org chart's depth varies and isn't known in advance.

Anatomy of a recursive CTE

WITH RECURSIVE subordinates AS (
    -- Base case (anchor member): the starting row(s)
    SELECT id, name, manager_id, 1 AS depth
    FROM employees
    WHERE id = 1   -- Alice's id

    UNION ALL

    -- Recursive member: references the CTE itself
    SELECT e.id, e.name, e.manager_id, s.depth + 1
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.id
)
SELECT * FROM subordinates;

How it executes conceptually:

Run the base case (anchor) once — this seeds the initial working set (Alice herself).
Run the recursive member using only the newest rows added in the previous iteration, producing a new batch of rows (Alice's direct reports).
Repeat step 2, each time joining against only the previous iteration's new rows, until an iteration produces zero new rows — then stop.
UNION ALL all iterations' results together as the final output.

Termination

The recursion must eventually stop producing new rows, or it runs forever (or until an engine-enforced recursion limit is hit — PostgreSQL doesn't have a hard limit by default and will happily loop indefinitely on a cyclic graph without one; SQL Server defaults to a 100-level MAXRECURSION limit specifically to guard against this). For genuinely cyclic data (e.g., a graph where cycles are possible, unlike a strict tree), you must explicitly track visited nodes and exclude them to avoid infinite recursion:

WITH RECURSIVE paths AS (
    SELECT start_node, end_node, ARRAY[start_node] AS visited
    FROM edges WHERE start_node = 'A'

    UNION ALL

    SELECT e.start_node, e.end_node, p.visited || e.end_node
    FROM edges e
    JOIN paths p ON e.start_node = p.end_node
    WHERE NOT e.end_node = ANY(p.visited)   -- prevents revisiting a node, avoiding infinite loops
)
SELECT * FROM paths;

Common use cases

Org charts / management chains (as above).
Category/product hierarchies (find all subcategories under "Electronics," arbitrarily nested).
Bill-of-materials explosions (a product made of sub-assemblies, made of sub-sub-assemblies...).
Graph traversal — shortest/all paths between two nodes, dependency resolution.

Performance note

Recursive CTEs can be slow on deep or wide hierarchies since each level is a fresh join pass; for read-heavy, rarely-changing hierarchies, a materialized path or closure table (precomputed ancestor/descendant pairs, maintained on write) is a common denormalization that trades write complexity for much faster read queries — worth mentioning as the production alternative when a recursive CTE becomes a bottleneck.

Related Resources

PostgreSQL: Recursive Queries

Open as page

Regular view — a saved query, no storage

CREATE VIEW active_customer_totals AS
SELECT c.id, c.name, SUM(o.total) AS lifetime_total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.status = 'active'
GROUP BY c.id, c.name;

SELECT * FROM active_customer_totals WHERE lifetime_total > 1000;

Querying the view re-runs the underlying JOIN/GROUP BY every single time — it's purely a naming/abstraction convenience (hiding query complexity, centralizing a business definition, or restricting column access for security) with zero performance benefit over just writing the underlying query directly. It's always exactly as current as the base tables, because there's no cached/stored copy at all.

Materialized view — stored, physical results

CREATE MATERIALIZED VIEW active_customer_totals_mv AS
SELECT c.id, c.name, SUM(o.total) AS lifetime_total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.status = 'active'
GROUP BY c.id, c.name;

-- Reading this is now a simple table scan/index lookup -- no join/aggregation cost
SELECT * FROM active_customer_totals_mv WHERE lifetime_total > 1000;

-- But it's now a snapshot -- must be explicitly refreshed to reflect new data
REFRESH MATERIALIZED VIEW active_customer_totals_mv;

By default, REFRESH in PostgreSQL takes a lock that blocks concurrent reads during the refresh; REFRESH MATERIALIZED VIEW CONCURRENTLY (which requires a unique index on the view) avoids that at the cost of a slower refresh.

Key tradeoffs

	Regular view	Materialized view
Storage	None (just a saved query)	Full physical copy of the result set
Read speed	Same as running the underlying query	Fast — like reading a table
Data freshness	Always current	Stale until refreshed
Write/refresh cost	None	Refresh re-runs the whole query (or does incremental refresh, if supported)
Can be indexed	No, but the underlying query can use base-table indexes	Yes — indexes can be created directly on the materialized view itself

When to use each

Regular view: encapsulating a complex or security-sensitive query behind a simple name, with no performance goal — freshness matters more than read speed, or the underlying query is cheap enough that re-running it is a non-issue.
Materialized view: an expensive aggregation/join that's read far more often than the underlying data changes — dashboards, reporting rollups, expensive analytics queries — where a controlled amount of staleness (refreshed hourly, nightly, or on a trigger) is an acceptable tradeoff for dramatically faster reads.

Refresh strategies

Scheduled (cron + REFRESH MATERIALIZED VIEW), triggered (refresh after a batch ETL job completes), or, in engines/extensions that support it, incremental refresh (only recomputing the delta rather than the whole view) — full refresh is the simplest but scales poorly if the underlying query is expensive and the view is large.

Related Resources

PostgreSQL: Materialized Views

Open as page

The starting shape: long/narrow data

-- sales: region, quarter, amount
-- ('East', 'Q1', 100), ('East', 'Q2', 150), ('West', 'Q1', 80), ('West', 'Q2', 120)

Goal: turn this into one row per region, with a column per quarter.

Approach: conditional aggregation (portable across all major engines)

SELECT
    region,
    SUM(CASE WHEN quarter = 'Q1' THEN amount ELSE 0 END) AS q1,
    SUM(CASE WHEN quarter = 'Q2' THEN amount ELSE 0 END) AS q2,
    SUM(CASE WHEN quarter = 'Q3' THEN amount ELSE 0 END) AS q3,
    SUM(CASE WHEN quarter = 'Q4' THEN amount ELSE 0 END) AS q4
FROM sales
GROUP BY region;

region	q1	q2	q3	q4
East	100	150	0	0
West	80	120	0	0

This works in every SQL engine, doesn't require any extension, and is the pattern most interviewers expect as the "default" answer, since it demonstrates understanding of conditional aggregation rather than reliance on engine-specific syntax.

Engine-specific dedicated syntax

SQL Server's PIVOT:

SELECT region, [Q1], [Q2], [Q3], [Q4]
FROM sales
PIVOT (SUM(amount) FOR quarter IN ([Q1], [Q2], [Q3], [Q4])) AS p;

PostgreSQL's crosstab() (requires the tablefunc extension):

CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT * FROM crosstab(
    'SELECT region, quarter, amount FROM sales ORDER BY 1, 2'
) AS ct(region text, q1 numeric, q2 numeric, q3 numeric, q4 numeric);

The key limitation: pivoted columns must generally be known ahead of time

All three approaches require you to know the distinct values that will become columns (Q1..Q4) at query-authoring time — SQL is a fixed-schema language, so a query can't natively produce a variable number of output columns based on data it hasn't seen yet. If the pivoted values are truly dynamic (unknown until runtime), you need to generate the SQL dynamically in application code or via dynamic SQL (EXECUTE/sp_executesql) — a real limitation worth mentioning, since it's a common follow-up question ("what if you don't know the quarters in advance?").

Related Resources

SQL Server: PIVOT and UNPIVOT

Open as page

Basic usage

SELECT
    sale_date,
    amount,
    LAG(amount, 1) OVER (ORDER BY sale_date) AS prev_day_amount,
    LEAD(amount, 1) OVER (ORDER BY sale_date) AS next_day_amount
FROM daily_sales;

sale_date	amount	prev_day_amount	next_day_amount
Jan 1	100	NULL	120
Jan 2	120	100	90
Jan 3	90	120	NULL

The first row's prev_day_amount and the last row's next_day_amount are NULL by default, since there's no row before/after them — you can supply an explicit default instead: LAG(amount, 1, 0) OVER (...).

Common use case: period-over-period change

SELECT
    sale_date,
    amount,
    amount - LAG(amount) OVER (ORDER BY sale_date) AS change_from_yesterday,
    ROUND(100.0 * (amount - LAG(amount) OVER (ORDER BY sale_date))
          / NULLIF(LAG(amount) OVER (ORDER BY sale_date), 0), 1) AS pct_change
FROM daily_sales;

This is an extremely common reporting/dashboard pattern ("day-over-day," "month-over-month" change) and, before window functions, required a self-join on date = date - 1 (see the self-join question) — much more awkward, and fragile if there are gaps in the dates (a self-join on exact date offsets silently produces NULL/missing rows for any gap, whereas LAG always looks at the actual previous row, regardless of any gap in the underlying dates).

Per-partition usage: compare within a group

SELECT
    customer_id, order_date, total,
    LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_date,
    order_date - LAG(order_date) OVER (PARTITION BY customer_id ORDER BY order_date) AS days_since_last_order
FROM orders;

PARTITION BY customer_id ensures LAG only looks at the same customer's previous row, not just the previous row in the whole table — essential whenever the row-to-row comparison should be scoped per group.

Detecting gaps or sequence breaks

SELECT id, LAG(id) OVER (ORDER BY id) AS prev_id
FROM records
WHERE id - LAG(id) OVER (ORDER BY id) > 1;   -- Note: filtering a window function needs a subquery/CTE wrapper in most engines

(In practice this filter needs to happen in an outer query/CTE, since WHERE can't reference a window function directly — see the execution-order question.) This pattern finds missing IDs in a sequence, a common data-quality check.

Related Resources

PostgreSQL: Window Functions

Open as page

A view is not a cached result

CREATE VIEW expensive_report AS
SELECT c.name, SUM(o.total) AS total_spent, COUNT(*) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
GROUP BY c.name;

SELECT * FROM expensive_report is functionally identical to pasting the entire SELECT ... GROUP BY inline — the optimizer typically inlines the view's definition into the outer query and optimizes the whole thing together (this is sometimes literally called "view merging"). There's no stored data, no caching — every read pays the full cost of the underlying joins and aggregation. This is the single most common misconception about views: creating one does not make a slow query fast.

Where views genuinely help

Readability/abstraction — hiding a complex join behind a simple name that application code (or analysts) can query without understanding the full underlying schema.
Security — granting access to a view that exposes only certain columns/rows, without granting direct table access:

CREATE VIEW public_employee_directory AS
SELECT name, department, work_email FROM employees;   -- omits salary, ssn, home_address

GRANT SELECT ON public_employee_directory TO reporting_role;

Centralizing business logic — if "active customer" has a specific, non-obvious definition, defining it once in a view avoids every query re-implementing (and potentially getting slightly wrong) the same filter logic.
Backward compatibility during schema migrations — a view can present an old column/table shape backed by a newer underlying schema, buying time to migrate consumers.

Where views can hurt performance

Nested views (a view built on top of another view, built on top of another view) can produce surprisingly complex, hard-to-optimize query plans, since each layer adds more joins/subqueries for the optimizer to reason about — sometimes defeating predicate pushdown that would otherwise let the optimizer filter early.
A view gives a false sense that "the hard part is already solved," leading developers to add further filters/joins on top of an already-expensive view without realizing the full cost still applies underneath.

The real performance tool: materialized views

If the goal is genuinely to avoid re-computing an expensive query on every read, a materialized view (see that question) — not a regular view — is the correct tool, since it actually stores the result physically.

Related Resources

PostgreSQL: CREATE VIEW

Open as page

Running total

SELECT
    sale_date,
    amount,
    SUM(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM daily_sales;

sale_date	amount	running_total
Jan 1	100	100
Jan 2	120	220
Jan 3	90	310

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means "every row from the very start of the partition up through the current row" — the textbook definition of a running total. In most engines, simply adding ORDER BY inside OVER (...) with SUM() and no explicit frame defaults to exactly this frame anyway, but writing it explicitly is clearer and avoids relying on an implicit default that some engines define differently.

Moving average over a fixed window (e.g., 7-day)

SELECT
    sale_date,
    amount,
    AVG(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7day
FROM daily_sales;

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW means "the current row plus the 6 rows before it" — 7 rows total, a rolling 7-period average. Note this counts rows, not calendar days — if daily_sales has a missing date (no row for that day), the "7 preceding rows" span more than 7 calendar days. Use RANGE instead of ROWS with an actual interval (RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW, supported by engines with proper RANGE-with-interval support) if you need true calendar-based windows regardless of gaps.

Running total per group

SELECT
    customer_id, order_date, amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS customer_running_total
FROM orders;

PARTITION BY customer_id resets the running total independently for each customer — critical when the running calculation should be scoped per group rather than across the whole table.

ROWS vs RANGE — the subtle distinction

ROWS counts physical rows in the frame; RANGE (with ORDER BY) counts logical peer groups — rows with the same ORDER BY value are treated as a single unit, which matters if your ORDER BY column has ties (e.g., multiple sales on the exact same timestamp) and you want them all included or excluded together rather than split arbitrarily by physical row order.

Why this beats the pre-window-function alternative

Before window functions, a running total required a correlated subquery re-summing from the start on every row (SELECT SUM(amount) FROM t t2 WHERE t2.date <= t.date) — an O(n²) pattern for n rows, since each row triggers its own full re-scan. A window function computes this in a single pass with O(n log n) or better, depending on the engine's implementation — a meaningful, measurable performance difference on large tables.

Related Resources

PostgreSQL: Window Function Frame Clauses