Explain INNER, LEFT, RIGHT, and FULL OUTER JOIN with examples

`INNER JOIN` returns only rows with a match in both tables. `LEFT JOIN` returns every row from the left table plus matching right-table columns (or `NULL` if no match). `RIGHT JOIN` is the mirror image. `FULL OUTER JOIN` returns all rows from both tables, filling in `NULL` wherever a match is missing on either side.

What is a CROSS JOIN, and when would you use one?

A `CROSS JOIN` produces the Cartesian product of two tables — every row of the first paired with every row of the second, with no join condition. It's rarely used to combine unrelated business data (row counts multiply, N×M), but is genuinely useful for generating combinations: date × store, size × color, or a numbers/calendar table joined against real data to fill gaps.

What is a self-join, and what's a practical use case?

A self-join joins a table to itself, using table aliases to distinguish the two "copies." It's the standard way to compare rows within the same table to each other — classic examples are employee/manager hierarchies, finding pairs (duplicate detection), or comparing a row to the previous/next row by some ordering.

EXISTS vs IN vs JOIN — when do you use each?

`JOIN` combines and returns columns from both tables and is right when you need data from both sides. `IN` checks membership against a subquery's result list — simple, but historically weaker with NULLs and large lists in some optimizers. `EXISTS` checks only for the *presence* of at least one matching row and stops as soon as it finds one, making it well-suited (and NULL-safe) for filtering existence without needing any columns from the other table. Modern optimizers often produce the same plan for all three when used equivalently, but the NULL-safety and clarity differences still matter.

How do you find duplicate rows in a table using SQL?

Group by the column(s) that define a "duplicate" and filter with `HAVING COUNT(*) > 1`. To also see or delete the individual duplicate rows (not just the duplicate key), join back to the base table or use a window function like `ROW_NUMBER() OVER (PARTITION BY ...)` and filter to rows where the row number is greater than 1.

How do you find rows in one table that have no match in another (anti-join)?

Use a `LEFT JOIN` and filter for `WHERE right_table.key IS NULL` (a classic "left anti-join"), or use `NOT EXISTS` with a correlated subquery — the latter is generally preferred because it's NULL-safe and its intent ("no matching row exists") is more explicit than a `LEFT JOIN`/`IS NULL` combination.

What happens to NULLs in a join condition?

Standard joins (`INNER`, `LEFT`, `RIGHT`, `FULL`) use `=` in their `ON` condition by default, and `NULL = NULL` evaluates to `UNKNOWN`, not `TRUE` — so rows with `NULL` in the join column never match each other, even if both sides are `NULL`. This differs from `GROUP BY`, `DISTINCT`, and `UNIQUE` constraints, which typically treat multiple `NULL`s as equal to each other for grouping/uniqueness purposes.

What is a join explosion (Cartesian-like row multiplication), and how do you avoid it?

A join explosion happens when a join condition matches more rows than expected — most often because you're joining against a table where the "key" isn't actually unique per row of the other side — causing rows to multiply and aggregates (like `SUM`) to be inflated. It's avoided by joining on truly unique keys, aggregating one side *before* joining, or using `DISTINCT`/window functions to deduplicate after the fact.

How would you write a query to find the Nth highest salary?

The most robust modern approach uses `DENSE_RANK()` (or `RANK()`) in a window function and filters for rank = N, which correctly handles ties. Older approaches use `LIMIT`/`OFFSET` with `DISTINCT` and `ORDER BY DESC`, or a correlated subquery counting how many distinct salaries are greater. Window functions are generally preferred for clarity and correct tie handling.

Joins and Set-Based Querying

Combining rows across tables — join types, anti-joins, and the classic query patterns interviewers love to ask.

Difficulty

Open as page

Given two tables:

-- customers: id, name
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')

-- orders: id, customer_id, total
-- (101, 1, 50), (102, 1, 75), (103, 2, 20)
-- Note: Carol (id 3) has no orders, and there's no order for a nonexistent customer_id 9.

INNER JOIN — only matching rows

SELECT c.name, o.total
FROM customers c
INNER JOIN orders o ON o.customer_id = c.id;

Result: Alice/50, Alice/75, Bob/20 — Carol is excluded entirely because she has no matching order row.

LEFT JOIN — all of the left table, matches or NULL

SELECT c.name, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id;

Result: Alice/50, Alice/75, Bob/20, Carol/NULL — every customer appears at least once, even with no orders.

RIGHT JOIN — all of the right table, matches or NULL

SELECT c.name, o.total
FROM customers c
RIGHT JOIN orders o ON o.customer_id = c.id;

Functionally identical to swapping the tables and using LEFT JOIN. Most style guides prefer always writing LEFT JOIN and reordering the FROM/JOIN tables instead, since it reads more consistently across a codebase.

FULL OUTER JOIN — everything from both sides

SELECT c.name, o.total
FROM customers c
FULL OUTER JOIN orders o ON o.customer_id = c.id;

Result: every customer (even unmatched, like Carol) and every order (even one belonging to a since-deleted customer, if such a row existed) — NULL fills in whichever side lacks a match. MySQL has no native FULL OUTER JOIN; it's emulated with LEFT JOIN UNION RIGHT JOIN.

Picking the right one

Need only rows that exist in both → INNER JOIN.
Need every row from a "primary" table regardless of whether related data exists → LEFT JOIN (e.g., all customers, with order totals if any).
Need a full reconciliation between two sets, including orphans on either side → FULL OUTER JOIN (common in data-quality/migration checks).

Related Resources

PostgreSQL: Joins Between Tables

Open as page

SELECT s.size, c.color
FROM sizes s
CROSS JOIN colors c;

If sizes has 4 rows and colors has 6 rows, this returns 24 rows — every combination, with no matching condition at all.

Legitimate use cases

Generating a full combination matrix — e.g., every product variant (size × color) that could exist, even before any have been created:

SELECT p.product_id, s.size, c.color
FROM products p
CROSS JOIN sizes s
CROSS JOIN colors c;

Filling gaps with a calendar/numbers table — a very common reporting pattern. Suppose you want a row for every day in a range, even days with zero sales:

SELECT d.day, COALESCE(SUM(s.amount), 0) AS total_sales
FROM generate_series('2024-01-01'::date, '2024-01-31'::date, '1 day') AS d(day)
CROSS JOIN stores st
LEFT JOIN sales s ON s.sale_date = d.day AND s.store_id = st.id
GROUP BY d.day, st.id;

This cross-joins a generated date series against every store, then left-joins actual sales — guaranteeing every (day, store) pair appears in the output, with zero-sales days showing 0 instead of being silently missing.

The accidental CROSS JOIN

The more important interview point: an accidental Cartesian product from a missing or wrong join condition is one of the most common real-world SQL bugs:

-- BUG: no ON clause and old-style comma syntax with no WHERE — this is
-- effectively a CROSS JOIN, silently multiplying every order by every customer.
SELECT o.id, c.name
FROM orders o, customers c;

If orders has 10,000 rows and customers has 5,000, that's 50 million rows — usually surfacing as a query that "hangs" or a report with wildly inflated totals, not an obvious error. Always double check that every table in a join has an explicit, correct join condition.

Related Resources

PostgreSQL: CROSS JOIN

Open as page

A self-join is just a regular join where both sides of the FROM/JOIN reference the same table, distinguished by aliases.

Classic example: employee/manager hierarchy

-- employees: id, name, manager_id (references employees.id)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

LEFT JOIN (rather than INNER JOIN) matters here so the CEO — who has no manager, i.e. manager_id IS NULL — still appears in the result, with manager as NULL instead of being dropped.

Finding duplicate rows

SELECT a.id, b.id, a.email
FROM users a
JOIN users b ON a.email = b.email AND a.id < b.id;

The a.id < b.id condition both prevents matching a row with itself (a.id = b.id) and prevents each duplicate pair from showing up twice (once as a, b and once as b, a).

Comparing adjacent rows (before window functions)

-- For each day, show today's sales vs. yesterday's sales, per store
SELECT today.store_id, today.sale_date, today.amount, yesterday.amount AS prev_day_amount
FROM sales today
LEFT JOIN sales yesterday
    ON yesterday.store_id = today.store_id
    AND yesterday.sale_date = today.sale_date - INTERVAL '1 day';

This is a real, still-valid pattern, though modern SQL usually solves it more cleanly with the LAG() window function (see the window functions topic) — the self-join version is worth knowing because not every engine/version supports window functions, and it demonstrates the underlying relational logic explicitly.

Key mechanics

Always alias both references to the table — referencing an unaliased self-joined table is ambiguous and will error.
Watch for whether you want INNER or LEFT JOIN — a self-join on a nullable foreign key (like manager_id) needs LEFT JOIN to avoid silently dropping rows with no self-reference.

Related Resources

PostgreSQL: Table Aliases

Open as page

All three can express "give me rows in A related to rows in B," but they answer different underlying questions.

JOIN — when you need columns from both tables

SELECT c.name, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id;

If the goal is to display or aggregate data from both tables, JOIN is the natural tool — IN/EXISTS can't return columns from the subquery side.

IN — membership against a list

SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);

Simple and readable for "customers who have at least one qualifying order," when you don't need any order columns in the output. Danger: if the subquery can return a NULL (e.g., customer_id is nullable and some order rows have it unset) and you're using NOT IN, the entire result silently becomes empty due to three-valued logic (see the NULL question) — this is one of the most common real-world SQL correctness bugs.

EXISTS — pure existence check, NULL-safe

SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.total > 1000
);

EXISTS only cares whether the subquery returns any row — the SELECT 1 is idiomatic because the actual selected value is irrelevant. It's NULL-safe by construction and, critically, the safe choice for negation:

-- Safe even if orders.customer_id can be NULL
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

-- Unsafe: silently returns zero rows if any orders.customer_id is NULL
SELECT * FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);

Duplication behavior

A JOIN can multiply output rows if a customer has multiple matching orders (one output row per match), which then requires DISTINCT or aggregation to get back to one row per customer. IN/EXISTS never duplicate the outer row regardless of how many subquery rows match, since they're boolean filters, not row-multiplying joins — this is often the deciding factor when you just need a yes/no filter, not order details.

Need columns from the related table → JOIN.
Need a plain existence/membership filter and the subquery result can't contain NULL, or you're doing a positive check → IN is fine and often reads well for short static lists.
Need a negative existence check (NOT ...), or the subquery column might contain NULL → always use NOT EXISTS, never NOT IN.

Related Resources

PostgreSQL: Subquery Expressions

Open as page

Step 1: find which values are duplicated

SELECT email, COUNT(*) AS occurrences
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

This tells you which emails are duplicated and how many times, but not the individual row IDs.

Step 2 (if you need the actual duplicate rows): join back

SELECT u.*
FROM users u
JOIN (
    SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1
) dupes ON dupes.email = u.email
ORDER BY u.email;

A cleaner approach with window functions: identify which copy to keep

The most common real task isn't just "find duplicates" — it's "keep one copy and delete the rest." ROW_NUMBER() handles both:

SELECT id, email, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users;
-- rn = 1 is the row you'd keep (lowest id) per email; rn > 1 are duplicates

Deleting the duplicates, keeping the earliest row per email:

DELETE FROM users
WHERE id IN (
    SELECT id FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
        FROM users
    ) t
    WHERE rn > 1
);

(Some engines, like PostgreSQL, let you write this more directly with a CTE and DELETE ... USING.)

Preventing future duplicates

Finding and cleaning duplicates is a one-time fix — the durable fix is a UNIQUE constraint (or unique index) on the column(s) that must not repeat:

ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);

Note you generally have to clean up existing duplicates before this constraint can be added, since the database will refuse to create a unique index over data that already violates it.

Related Resources

PostgreSQL: Window Functions

Open as page

An anti-join answers "give me rows from A that have no corresponding row in B" — e.g., customers who have never placed an order.

Approach 1: LEFT JOIN + IS NULL

SELECT c.*
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.id IS NULL;

The LEFT JOIN keeps every customer even without a match, filling unmatched columns with NULL; filtering WHERE o.id IS NULL then keeps only the customers where no match occurred.

Approach 2: NOT EXISTS (generally preferred)

SELECT c.*
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id
);

Why NOT EXISTS is usually the better choice

Clarity of intent — NOT EXISTS directly states "no matching row exists"; LEFT JOIN ... IS NULL requires the reader to infer the anti-join pattern from a filter condition on an outer-joined column.
NULL-safety — if you instead reached for NOT IN (SELECT customer_id FROM orders), a single NULL in orders.customer_id silently breaks the whole query (see the NULL/three-valued-logic question); NOT EXISTS has no such trap.
No risk of picking the wrong column to check — with LEFT JOIN ... IS NULL, you must check a column that's guaranteed NOT NULL on the right side (like its primary key); accidentally checking a nullable right-side column produces wrong results when a match exists but that particular column happens to be NULL.

Most modern optimizers (PostgreSQL, SQL Server) recognize both patterns and produce the same anti-join execution plan (typically a hash anti-join or nested-loop anti-join), so performance is usually a non-issue — the choice comes down to correctness and readability, and NOT EXISTS wins on both.

Related Resources

Use the Index, Luke: Anti-Joins

Open as page

-- orders: id, customer_id
-- (1, 5), (2, NULL), (3, NULL)

SELECT * FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
WHERE a.id <> b.id;

You might expect rows 2 and 3 to match each other since they both have customer_id = NULL — but they don't, because NULL = NULL is UNKNOWN, and ON (like WHERE) only keeps rows where the condition evaluates to TRUE.

Why this matters in practice

This is the correct, standard SQL behavior, and it's usually what you want — two "unknown" values shouldn't be treated as definitely equal. But it surprises people who expect join semantics to match GROUP BY/DISTINCT semantics, where multiple NULLs in a grouping column are, by convention, grouped together as if equal:

SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;
-- all NULL customer_id rows land in ONE group together

Similarly, most engines' UNIQUE constraints treat multiple NULLs as not violating uniqueness (i.e., you can insert many rows with NULL in a unique column) — again, a different rule than the join comparison.

If you actually need NULL = NULL to match

Use IS NOT DISTINCT FROM (standard SQL, supported by PostgreSQL and others) instead of =:

SELECT * FROM a
JOIN b ON a.key IS NOT DISTINCT FROM b.key;

IS [NOT] DISTINCT FROM is a NULL-safe equality comparison — NULL IS NOT DISTINCT FROM NULL evaluates to TRUE. MySQL has an equivalent NULL-safe operator, <=>. This is important for tables where a foreign key column can legitimately be NULL and you actually want those rows to match each other during a join (rare, but it does come up in data-reconciliation queries).

Related Resources

PostgreSQL: IS DISTINCT FROM

Open as page

The classic real-world bug: you join orders to order_items to compute a per-order total, but then also join to order_notes (also one-to-many) in the same query — and every order's line items get multiplied by its note count.

-- orders (1 row per order) x order_items (many rows per order)
--        x order_notes (many rows per order)
SELECT o.id, SUM(oi.price)
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
JOIN order_notes  n ON n.order_id  = o.id     -- BUG: multiplies rows
GROUP BY o.id;

If an order has 3 line items and 2 notes, this produces 3 × 2 = 6 rows for that order before the GROUP BY/SUM even runs — so SUM(oi.price) double- or triple-counts each item's price, inflating the total. This is far more dangerous than an obvious CROSS JOIN because the query looks correct and returns plausible-looking numbers.

Fixes

1. Aggregate each one-to-many side independently before joining:

SELECT o.id,
       items.total_price,
       notes.note_count
FROM orders o
LEFT JOIN (SELECT order_id, SUM(price) AS total_price FROM order_items GROUP BY order_id) items
    ON items.order_id = o.id
LEFT JOIN (SELECT order_id, COUNT(*) AS note_count FROM order_notes GROUP BY order_id) notes
    ON notes.order_id = o.id;

2. Use COUNT(DISTINCT ...) as a (partial) safety net — this fixes counting but not SUM, so it's not a general solution:

SELECT o.id, COUNT(DISTINCT oi.id) AS item_count, COUNT(DISTINCT n.id) AS note_count
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
JOIN order_notes  n ON n.order_id  = o.id
GROUP BY o.id;

3. Split into separate queries when the two one-to-many relationships genuinely don't need to be correlated in a single result row.

The general rule

Before joining, ask: "is this join guaranteed to match at most one row?" If both sides of every join in a query aren't 1:1 (or the "many" side is aggregated first), any SUM/COUNT/AVG computed after multiple one-to-many joins is suspect and should be double-checked against a simpler, join-free query on a small sample.

Related Resources

Use the Index, Luke: Join Operations

Open as page

This is one of the most frequently asked "classic" SQL interview questions, largely because there are several valid approaches with different tie-handling behavior.

Approach 1: DENSE_RANK() — handles ties correctly (recommended)

SELECT DISTINCT salary
FROM (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
    FROM employees
) ranked
WHERE rnk = 3;   -- 3rd highest *distinct* salary

DENSE_RANK() assigns the same rank to tied values and doesn't skip ranks afterward — so if two employees are tied for 2nd highest, the next distinct salary is still rank 3 (unlike RANK(), which would skip to rank 4).

Approach 2: OFFSET/FETCH with DISTINCT

SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
OFFSET 2 ROWS FETCH NEXT 1 ROWS ONLY;   -- skip top 2, take the next 1 = 3rd highest

(In MySQL/PostgreSQL, the equivalent is LIMIT 1 OFFSET 2.) This works but is less explicit about tie handling, and OFFSET-based pagination degrades in performance on large tables since the engine still has to sort/scan through the skipped rows.

Approach 3: correlated subquery (portable to older engines without window functions)

SELECT DISTINCT salary
FROM employees e1
WHERE 2 = (
    SELECT COUNT(DISTINCT salary) FROM employees e2 WHERE e2.salary > e1.salary
);

This reads as "salary such that exactly 2 distinct salaries are greater than it" — i.e., the 3rd highest.

Why DISTINCT matters in every version

Without DISTINCT (or DENSE_RANK's tie-aware grouping), the "2nd highest salary" query using plain LIMIT 1 OFFSET 1 would return a duplicate of the highest salary if two employees are tied for 1st — a subtle bug that only shows up with tied data, making it worth explicitly discussing tie-handling behavior in an interview rather than assuming any one approach is obviously correct.

Related Resources

PostgreSQL: Window Functions