Given two tables:
-- customers: id, name
-- (1, 'Alice'), (2, 'Bob'), (3, 'Carol')
-- orders: id, customer_id, total
-- (101, 1, 50), (102, 1, 75), (103, 2, 20)
-- Note: Carol (id 3) has no orders, and there's no order for a nonexistent customer_id 9.
INNER JOIN — only matching rows
SELECT c.name, o.total
FROM customers c
INNER JOIN orders o ON o.customer_id = c.id;
Result: Alice/50, Alice/75, Bob/20 — Carol is excluded entirely because she has no matching order row.
LEFT JOIN — all of the left table, matches or NULL
SELECT c.name, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id;
Result: Alice/50, Alice/75, Bob/20, Carol/NULL — every customer appears at least once, even with no orders.
RIGHT JOIN — all of the right table, matches or NULL
SELECT c.name, o.total
FROM customers c
RIGHT JOIN orders o ON o.customer_id = c.id;
Functionally identical to swapping the tables and using LEFT JOIN. Most style guides prefer always writing LEFT JOIN and reordering the FROM/JOIN tables instead, since it reads more consistently across a codebase.
FULL OUTER JOIN — everything from both sides
SELECT c.name, o.total
FROM customers c
FULL OUTER JOIN orders o ON o.customer_id = c.id;
Result: every customer (even unmatched, like Carol) and every order (even one belonging to a since-deleted customer, if such a row existed) — NULL fills in whichever side lacks a match. MySQL has no native FULL OUTER JOIN; it's emulated with LEFT JOIN UNION RIGHT JOIN.
Picking the right one
- Need only rows that exist in both →
INNER JOIN. - Need every row from a "primary" table regardless of whether related data exists →
LEFT JOIN(e.g., all customers, with order totals if any). - Need a full reconciliation between two sets, including orphans on either side →
FULL OUTER JOIN(common in data-quality/migration checks).
Related Resources
SELECT s.size, c.color
FROM sizes s
CROSS JOIN colors c;
If sizes has 4 rows and colors has 6 rows, this returns 24 rows — every combination, with no matching condition at all.
Legitimate use cases
Generating a full combination matrix — e.g., every product variant (size × color) that could exist, even before any have been created:
SELECT p.product_id, s.size, c.color
FROM products p
CROSS JOIN sizes s
CROSS JOIN colors c;
Filling gaps with a calendar/numbers table — a very common reporting pattern. Suppose you want a row for every day in a range, even days with zero sales:
SELECT d.day, COALESCE(SUM(s.amount), 0) AS total_sales
FROM generate_series('2024-01-01'::date, '2024-01-31'::date, '1 day') AS d(day)
CROSS JOIN stores st
LEFT JOIN sales s ON s.sale_date = d.day AND s.store_id = st.id
GROUP BY d.day, st.id;
This cross-joins a generated date series against every store, then left-joins actual sales — guaranteeing every (day, store) pair appears in the output, with zero-sales days showing 0 instead of being silently missing.
The accidental CROSS JOIN
The more important interview point: an accidental Cartesian product from a missing or wrong join condition is one of the most common real-world SQL bugs:
-- BUG: no ON clause and old-style comma syntax with no WHERE — this is
-- effectively a CROSS JOIN, silently multiplying every order by every customer.
SELECT o.id, c.name
FROM orders o, customers c;
If orders has 10,000 rows and customers has 5,000, that's 50 million rows — usually surfacing as a query that "hangs" or a report with wildly inflated totals, not an obvious error. Always double check that every table in a join has an explicit, correct join condition.
Related Resources
A self-join is just a regular join where both sides of the FROM/JOIN reference the same table, distinguished by aliases.
Classic example: employee/manager hierarchy
-- employees: id, name, manager_id (references employees.id)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
LEFT JOIN (rather than INNER JOIN) matters here so the CEO — who has no manager, i.e. manager_id IS NULL — still appears in the result, with manager as NULL instead of being dropped.
Finding duplicate rows
SELECT a.id, b.id, a.email
FROM users a
JOIN users b ON a.email = b.email AND a.id < b.id;
The a.id < b.id condition both prevents matching a row with itself (a.id = b.id) and prevents each duplicate pair from showing up twice (once as a, b and once as b, a).
Comparing adjacent rows (before window functions)
-- For each day, show today's sales vs. yesterday's sales, per store
SELECT today.store_id, today.sale_date, today.amount, yesterday.amount AS prev_day_amount
FROM sales today
LEFT JOIN sales yesterday
ON yesterday.store_id = today.store_id
AND yesterday.sale_date = today.sale_date - INTERVAL '1 day';
This is a real, still-valid pattern, though modern SQL usually solves it more cleanly with the LAG() window function (see the window functions topic) — the self-join version is worth knowing because not every engine/version supports window functions, and it demonstrates the underlying relational logic explicitly.
Key mechanics
- Always alias both references to the table — referencing an unaliased self-joined table is ambiguous and will error.
- Watch for whether you want
INNERorLEFT JOIN— a self-join on a nullable foreign key (likemanager_id) needsLEFT JOINto avoid silently dropping rows with no self-reference.
Related Resources
All three can express "give me rows in A related to rows in B," but they answer different underlying questions.
JOIN — when you need columns from both tables
SELECT c.name, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id;
If the goal is to display or aggregate data from both tables, JOIN is the natural tool — IN/EXISTS can't return columns from the subquery side.
IN — membership against a list
SELECT * FROM customers
WHERE id IN (SELECT customer_id FROM orders WHERE total > 1000);
Simple and readable for "customers who have at least one qualifying order," when you don't need any order columns in the output. Danger: if the subquery can return a NULL (e.g., customer_id is nullable and some order rows have it unset) and you're using NOT IN, the entire result silently becomes empty due to three-valued logic (see the NULL question) — this is one of the most common real-world SQL correctness bugs.
EXISTS — pure existence check, NULL-safe
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.total > 1000
);
EXISTS only cares whether the subquery returns any row — the SELECT 1 is idiomatic because the actual selected value is irrelevant. It's NULL-safe by construction and, critically, the safe choice for negation:
-- Safe even if orders.customer_id can be NULL
SELECT * FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
-- Unsafe: silently returns zero rows if any orders.customer_id is NULL
SELECT * FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);
Duplication behavior
A JOIN can multiply output rows if a customer has multiple matching orders (one output row per match), which then requires DISTINCT or aggregation to get back to one row per customer. IN/EXISTS never duplicate the outer row regardless of how many subquery rows match, since they're boolean filters, not row-multiplying joins — this is often the deciding factor when you just need a yes/no filter, not order details.
- Need columns from the related table →
JOIN. - Need a plain existence/membership filter and the subquery result can't contain
NULL, or you're doing a positive check →INis fine and often reads well for short static lists. - Need a negative existence check (
NOT ...), or the subquery column might containNULL→ always useNOT EXISTS, neverNOT IN.
Related Resources
Step 1: find which values are duplicated
SELECT email, COUNT(*) AS occurrences
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
This tells you which emails are duplicated and how many times, but not the individual row IDs.
Step 2 (if you need the actual duplicate rows): join back
SELECT u.*
FROM users u
JOIN (
SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1
) dupes ON dupes.email = u.email
ORDER BY u.email;
A cleaner approach with window functions: identify which copy to keep
The most common real task isn't just "find duplicates" — it's "keep one copy and delete the rest." ROW_NUMBER() handles both:
SELECT id, email, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users;
-- rn = 1 is the row you'd keep (lowest id) per email; rn > 1 are duplicates
Deleting the duplicates, keeping the earliest row per email:
DELETE FROM users
WHERE id IN (
SELECT id FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users
) t
WHERE rn > 1
);
(Some engines, like PostgreSQL, let you write this more directly with a CTE and DELETE ... USING.)
Preventing future duplicates
Finding and cleaning duplicates is a one-time fix — the durable fix is a UNIQUE constraint (or unique index) on the column(s) that must not repeat:
ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE (email);
Note you generally have to clean up existing duplicates before this constraint can be added, since the database will refuse to create a unique index over data that already violates it.