SQL Fundamentals and Query Basics

Difficulty

SQL is split into sub-languages by what kind of change a statement makes, and that split matters in practice because it determines things like transaction behavior, required privileges, and whether an operation can be rolled back.

The four categories

CategoryFull nameExample statementsWhat it affects
DDLData Definition LanguageCREATE TABLE, ALTER TABLE, DROP TABLE, TRUNCATESchema/structure (tables, indexes, constraints)
DMLData Manipulation LanguageSELECT, INSERT, UPDATE, DELETERow-level data
DCLData Control LanguageGRANT, REVOKEPermissions and access control
TCLTransaction Control LanguageCOMMIT, ROLLBACK, SAVEPOINT, SET TRANSACTIONTransaction boundaries
-- DDL: defines structure
CREATE TABLE accounts (
    id SERIAL PRIMARY KEY,
    balance NUMERIC(12,2) NOT NULL DEFAULT 0
);

-- DML: manipulates rows
INSERT INTO accounts (balance) VALUES (100.00);
UPDATE accounts SET balance = balance - 50 WHERE id = 1;

-- DCL: controls access
GRANT SELECT, INSERT ON accounts TO app_user;

-- TCL: controls the transaction
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
UPDATE accounts SET balance = balance + 50 WHERE id = 2;
COMMIT;

Why the distinction matters

Most production databases auto-commit DDL (or even implicitly commit any open transaction before running it) — running ALTER TABLE mid-transaction in MySQL, for example, causes an implicit commit, so you can't roll a schema change back the way you can an UPDATE. PostgreSQL is a notable exception: it supports transactional DDL, so a CREATE TABLE inside a BEGIN...ROLLBACK block really does disappear.

DCL statements are also typically not transactional in the same sense — permission changes often take effect immediately and aren't undone by ROLLBACK in many engines. Knowing which bucket a statement falls into tells you whether you can safely wrap it in a transaction for an atomic migration, or whether you need a different rollback strategy (e.g., a paired "down" migration script).

You write a SELECT statement top to bottom, but the database evaluates it in a different logical order. Understanding that order explains several rules that otherwise look arbitrary.

The logical order

1. FROM / JOIN     -- build the base row set
2. WHERE           -- filter individual rows
3. GROUP BY        -- bucket remaining rows into groups
4. HAVING          -- filter groups
5. SELECT          -- compute output expressions (incl. window functions)
6. DISTINCT        -- remove duplicate output rows
7. ORDER BY        -- sort the result
8. LIMIT / OFFSET  -- take a slice

Note this is the logical order — real optimizers reorder physical execution (e.g., pushing a WHERE predicate down before a join, or picking an index that satisfies ORDER BY for free) as long as the observable result is identical.

Why this explains common gotchas

You can't use a SELECT alias in WHERE:

-- Fails: "total" doesn't exist yet when WHERE is evaluated
SELECT price * quantity AS total FROM orders WHERE total > 100;

-- Works: recompute the expression, or move the filter to HAVING with a subquery/CTE
SELECT * FROM (
    SELECT price * quantity AS total FROM orders
) t WHERE total > 100;

But you can use it in ORDER BY — by the time ORDER BY runs, SELECT has already executed and the alias exists:

SELECT price * quantity AS total FROM orders ORDER BY total DESC;

WHERE can't filter on aggregates, HAVING can:

-- Fails: WHERE runs before GROUP BY, so COUNT(*) doesn't exist yet
SELECT customer_id, COUNT(*) FROM orders WHERE COUNT(*) > 5 GROUP BY customer_id;

-- Correct: HAVING runs after grouping
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id HAVING COUNT(*) > 5;

Window functions see SELECT-time data, not raw rows — they run after WHERE/GROUP BY/HAVING but before DISTINCT/ORDER BY/LIMIT, which is why you generally can't reference a window function's result directly in the same SELECT's WHERE clause (you need to wrap it in a subquery or CTE and filter in the outer query instead).

Both clauses filter rows, but at different stages of query processing (see the logical execution order: FROM → WHERE → GROUP BY → HAVING → SELECT).

WHERE: filters rows before grouping

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'   -- filters individual employee rows first
GROUP BY department;

HAVING: filters groups after aggregation

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 75000;       -- filters the resulting department groups

Combining both

SELECT department, COUNT(*) AS headcount
FROM employees
WHERE status = 'active'           -- row-level filter first
GROUP BY department
HAVING COUNT(*) >= 10;            -- group-level filter second

The rule of thumb

  • Use WHERE for conditions on raw column values that exist before grouping.
  • Use HAVING for conditions on the result of an aggregate function.
  • Trying to put an aggregate condition in WHERE (WHERE COUNT(*) > 5) fails, because COUNT(*) doesn't exist until GROUP BY has executed.
  • Filtering earlier with WHERE is almost always more efficient than filtering later with HAVING, because WHERE reduces the row set before the (often expensive) grouping and aggregation work happens. Never use HAVING for a condition that could be expressed in WHERE.

Related Resources

NULL represents "unknown" or "missing," not a value like zero or an empty string — and that distinction drives SQL's three-valued logic.

The three truth values

Any predicate evaluates to TRUE, FALSE, or UNKNOWN. NULL compared to anything — including another NULL — produces UNKNOWN:

SELECT NULL = NULL;      -- UNKNOWN (not TRUE!)
SELECT NULL <> NULL;     -- UNKNOWN
SELECT 5 = NULL;         -- UNKNOWN
SELECT 5 <> NULL;        -- UNKNOWN

WHERE and HAVING only keep rows where the condition is TRUEUNKNOWN is treated the same as FALSE for filtering purposes (but differently for NOT, see below).

Testing for NULL correctly

-- Wrong: always UNKNOWN, matches nothing
SELECT * FROM users WHERE middle_name = NULL;

-- Correct
SELECT * FROM users WHERE middle_name IS NULL;
SELECT * FROM users WHERE middle_name IS NOT NULL;

The classic NOT IN trap

This is the single most common NULL bug in production SQL:

-- If banned_ids contains even one NULL, this returns ZERO rows,
-- because "x NOT IN (1, 2, NULL)" expands to
-- "x <> 1 AND x <> 2 AND x <> NULL", and the last comparison is UNKNOWN,
-- which poisons the whole AND chain to UNKNOWN.
SELECT * FROM users WHERE id NOT IN (SELECT banned_id FROM bans);

Fix it with NOT EXISTS (which handles NULLs correctly) or by filtering NULLs out of the subquery explicitly:

SELECT * FROM users u
WHERE NOT EXISTS (SELECT 1 FROM bans b WHERE b.banned_id = u.id);

Useful NULL-handling functions

COALESCE(a, b, c)     -- returns the first non-NULL argument
NULLIF(a, b)          -- returns NULL if a = b, otherwise returns a
-- Avoid division by zero producing an error, return NULL instead
SELECT revenue / NULLIF(units_sold, 0) AS avg_price FROM sales;

-- Provide a default for a possibly-NULL column
SELECT COALESCE(nickname, first_name) AS display_name FROM users;

Also remember that aggregate functions like SUM, AVG, COUNT(column) all ignore NULLs silently — COUNT(*) counts rows, but COUNT(column) counts only non-NULL values in that column, which is a frequent source of off-by-some-amount bugs.

All three remove data or structure, but at very different granularities and costs.

DELETETRUNCATEDROP
RemovesMatching rows (or all rows)All rowsThe whole table object
Can filter with WHEREYesNoN/A
SpeedSlow for large tables (row-by-row logging)Fast (deallocates pages)Fast
Fires triggersYesUsually noNo
Resets identity/auto-incrementNoUsually yesN/A (table gone)
Transactional / rollback-ableYes, fullyDepends on engine (PostgreSQL: yes; MySQL/InnoDB: implicit commit)Depends on engine (same caveat)
Table structure afterwardUnchanged, empty or filteredUnchanged, emptyTable no longer exists
CategoryDMLDDL (in most engines)DDL
-- DELETE: row-level, filterable, fully logged
DELETE FROM orders WHERE status = 'cancelled';

-- TRUNCATE: removes everything, resets the table's storage
TRUNCATE TABLE orders;

-- DROP: the table itself is gone
DROP TABLE orders;

Why TRUNCATE is faster

DELETE scans and removes rows individually, writing an entry to the transaction/redo log per row (or per page) so it can be rolled back and so triggers can fire per row. TRUNCATE instead deallocates the data pages that back the table wholesale — it's closer to a DDL operation than a DML one, which is why many engines treat it as non-transactional or auto-committing.

  • Need to remove a subset of rows, want triggers to fire, or need it fully rollback-able mid-transaction → DELETE.
  • Need to empty a whole table fast (e.g., clearing a staging table between ETL runs) and don't need row-level rollback → TRUNCATE.
  • Need to remove the table definition entirely, including its indexes and constraints → DROP.
  • Be careful with TRUNCATE on tables referenced by foreign keys — most engines refuse to truncate a table that's the target of an active foreign key from another table unless you cascade or disable the constraint first.

Related Resources