Joins and Set-Based Querying

Difficulty

Given two tables:

-- customers: id, name
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')

-- orders: id, customer_id, total
-- (101, 1, 50), (102, 1, 75), (103, 2, 20)
-- Note: Carol (id 3) has no orders, and there's no order for a nonexistent customer_id 9.

INNER JOIN — only matching rows

SELECT c.name, o.total
FROM customers c
INNER JOIN orders o ON o.customer_id = c.id;

Result: Alice/50, Alice/75, Bob/20 — Carol is excluded entirely because she has no matching order row.

LEFT JOIN — all of the left table, matches or NULL

SELECT c.name, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id;

Result: Alice/50, Alice/75, Bob/20, Carol/NULL — every customer appears at least once, even with no orders.

RIGHT JOIN — all of the right table, matches or NULL

SELECT c.name, o.total
FROM customers c
RIGHT JOIN orders o ON o.customer_id = c.id;

Functionally identical to swapping the tables and using LEFT JOIN. Most style guides prefer always writing LEFT JOIN and reordering the FROM/JOIN tables instead, since it reads more consistently across a codebase.

FULL OUTER JOIN — everything from both sides

SELECT c.name, o.total
FROM customers c
FULL OUTER JOIN orders o ON o.customer_id = c.id;

Result: every customer (even unmatched, like Carol) and every order (even one belonging to a since-deleted customer, if such a row existed) — NULL fills in whichever side lacks a match. MySQL has no native FULL OUTER JOIN; it's emulated with LEFT JOIN UNION RIGHT JOIN.

Picking the right one

  • Need only rows that exist in both → INNER JOIN.
  • Need every row from a "primary" table regardless of whether related data exists → LEFT JOIN (e.g., all customers, with order totals if any).
  • Need a full reconciliation between two sets, including orphans on either side → FULL OUTER JOIN (common in data-quality/migration checks).
SELECT s.size, c.color
FROM sizes s
CROSS JOIN colors c;

If sizes has 4 rows and colors has 6 rows, this returns 24 rows — every combination, with no matching condition at all.

Legitimate use cases

Generating a full combination matrix — e.g., every product variant (size × color) that could exist, even before any have been created:

SELECT p.product_id, s.size, c.color
FROM products p
CROSS JOIN sizes s
CROSS JOIN colors c;

Filling gaps with a calendar/numbers table — a very common reporting pattern. Suppose you want a row for every day in a range, even days with zero sales:

SELECT d.day, COALESCE(SUM(s.amount), 0) AS total_sales
FROM generate_series('2024-01-01'::date, '2024-01-31'::date, '1 day') AS d(day)
CROSS JOIN stores st
LEFT JOIN sales s ON s.sale_date = d.day AND s.store_id = st.id
GROUP BY d.day, st.id;

This cross-joins a generated date series against every store, then left-joins actual sales — guaranteeing every (day, store) pair appears in the output, with zero-sales days showing 0 instead of being silently missing.

The accidental CROSS JOIN

The more important interview point: an accidental Cartesian product from a missing or wrong join condition is one of the most common real-world SQL bugs:

-- BUG: no ON clause and old-style comma syntax with no WHERE — this is
-- effectively a CROSS JOIN, silently multiplying every order by every customer.
SELECT o.id, c.name
FROM orders o, customers c;

If orders has 10,000 rows and customers has 5,000, that's 50 million rows — usually surfacing as a query that "hangs" or a report with wildly inflated totals, not an obvious error. Always double check that every table in a join has an explicit, correct join condition.

Related Resources

A self-join is just a regular join where both sides of the FROM/JOIN reference the same table, distinguished by aliases.

Classic example: employee/manager hierarchy

-- employees: id, name, manager_id (references employees.id)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

LEFT JOIN (rather than INNER JOIN) matters here so the CEO — who has no manager, i.e. manager_id IS NULL — still appears in the result, with manager as NULL instead of being dropped.

Finding duplicate rows

SELECT a.id, b.id, a.email
FROM users a
JOIN users b ON a.email = b.email AND a.id < b.id;

The a.id < b.id condition both prevents matching a row with itself (a.id = b.id) and prevents each duplicate pair from showing up twice (once as a, b and once as b, a).

Comparing adjacent rows (before window functions)

-- For each day, show today's sales vs. yesterday's sales, per store
SELECT today.store_id, today.sale_date, today.amount, yesterday.amount AS prev_day_amount
FROM sales today
LEFT JOIN sales yesterday
    ON yesterday.store_id = today.store_id
    AND yesterday.sale_date = today.sale_date - INTERVAL '1 day';

This is a real, still-valid pattern, though modern SQL usually solves it more cleanly with the LAG() window function (see the window functions topic) — the self-join version is worth knowing because not every engine/version supports window functions, and it demonstrates the underlying relational logic explicitly.

Key mechanics

  • Always alias both references to the table — referencing an unaliased self-joined table is ambiguous and will error.
  • Watch for whether you want INNER or LEFT JOIN — a self-join on a nullable foreign key (like manager_id) needs LEFT JOIN to avoid silently dropping rows with no self-reference.

Related Resources

All three can express "give me rows in A related to rows in B," but they answer different underlying questions.

JOIN — when you need columns from both tables

SELECT c.name, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id;

If the goal is to display or aggregate data from both tables, JOIN is the natural tool — IN/EXISTS can't return columns from the subquery side.

IN — membership against a list

SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);

Simple and readable for "customers who have at least one qualifying order," when you don't need any order columns in the output. Danger: if the subquery can return a NULL (e.g., customer_id is nullable and some order rows have it unset) and you're using NOT IN, the entire result silently becomes empty due to three-valued logic (see the NULL question) — this is one of the most common real-world SQL correctness bugs.

EXISTS — pure existence check, NULL-safe

SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.total > 1000
);

EXISTS only cares whether the subquery returns any row — the SELECT 1 is idiomatic because the actual selected value is irrelevant. It's NULL-safe by construction and, critically, the safe choice for negation:

-- Safe even if orders.customer_id can be NULL
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

-- Unsafe: silently returns zero rows if any orders.customer_id is NULL
SELECT * FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);

Duplication behavior

A JOIN can multiply output rows if a customer has multiple matching orders (one output row per match), which then requires DISTINCT or aggregation to get back to one row per customer. IN/EXISTS never duplicate the outer row regardless of how many subquery rows match, since they're boolean filters, not row-multiplying joins — this is often the deciding factor when you just need a yes/no filter, not order details.

  • Need columns from the related table → JOIN.
  • Need a plain existence/membership filter and the subquery result can't contain NULL, or you're doing a positive check → IN is fine and often reads well for short static lists.
  • Need a negative existence check (NOT ...), or the subquery column might contain NULL → always use NOT EXISTS, never NOT IN.

Step 1: find which values are duplicated

SELECT email, COUNT(*) AS occurrences
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

This tells you which emails are duplicated and how many times, but not the individual row IDs.

Step 2 (if you need the actual duplicate rows): join back

SELECT u.*
FROM users u
JOIN (
    SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1
) dupes ON dupes.email = u.email
ORDER BY u.email;

A cleaner approach with window functions: identify which copy to keep

The most common real task isn't just "find duplicates" — it's "keep one copy and delete the rest." ROW_NUMBER() handles both:

SELECT id, email, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users;
-- rn = 1 is the row you'd keep (lowest id) per email; rn > 1 are duplicates

Deleting the duplicates, keeping the earliest row per email:

DELETE FROM users
WHERE id IN (
    SELECT id FROM (
        SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
        FROM users
    ) t
    WHERE rn > 1
);

(Some engines, like PostgreSQL, let you write this more directly with a CTE and DELETE ... USING.)

Preventing future duplicates

Finding and cleaning duplicates is a one-time fix — the durable fix is a UNIQUE constraint (or unique index) on the column(s) that must not repeat:

ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);

Note you generally have to clean up existing duplicates before this constraint can be added, since the database will refuse to create a unique index over data that already violates it.