What are the normal forms (1NF, 2NF, 3NF, BCNF)?

**1NF** requires atomic column values and no repeating groups. **2NF** requires 1NF plus every non-key column depending on the *whole* primary key, not just part of a composite key. **3NF** requires 2NF plus no non-key column depending on another non-key column (no transitive dependencies). **BCNF** (Boyce-Codd) tightens 3NF further: every determinant must be a candidate key. Each level removes a specific class of update/insert/delete anomaly.

When would you intentionally denormalize a schema?

Denormalize when read performance matters more than write simplicity/consistency risk — typically for reporting/analytics tables, high-traffic read paths where joins are too expensive, or materializing computed aggregates. Common techniques: duplicating a rarely-changing column to avoid a join, storing a precomputed total, or flattening a hierarchy. The tradeoff is always the same: faster reads in exchange for update complexity and the risk of data drifting out of sync.

What's the difference between a primary key, a candidate key, and a foreign key?

A **candidate key** is any minimal set of columns that uniquely identifies a row — a table can have several. The **primary key** is the one candidate key chosen as the table's main identifier (enforced `NOT NULL` + `UNIQUE`, and typically what other tables reference). A **foreign key** is a column (or columns) in one table that references a primary/unique key in another table, enforcing referential integrity between them.

Surrogate key vs natural key — what are the tradeoffs?

A **natural key** is a column that has real-world business meaning (email, SSN, ISBN, order number). A **surrogate key** is an artificial, meaningless identifier generated purely for the database (auto-increment integer, UUID). Surrogate keys are immune to business-rule changes and are typically smaller/faster to index and join; natural keys are self-documenting and avoid an extra join to look up meaning, but they're at risk if the "unique, never-changes" assumption about them turns out to be wrong.

What database constraints are commonly used, and what does each enforce?

`NOT NULL` requires a value. `UNIQUE` disallows duplicate values (NULLs usually exempt). `PRIMARY KEY` combines `NOT NULL` + `UNIQUE` as the table's main identifier. `FOREIGN KEY` requires a value to exist in another table's key. `CHECK` enforces an arbitrary boolean expression per row. `DEFAULT` supplies a value when none is given. Together, these push data-validity rules into the database itself, rather than trusting every application layer to enforce them consistently.

How do you model a many-to-many relationship in a relational schema?

Relational tables can only express one-to-one and one-to-many relationships directly, so a many-to-many relationship requires a **junction table** (also called an associative or bridge table) with foreign keys to both sides, typically with a composite primary key (or a surrogate key plus a `UNIQUE` constraint on the pair). The junction table can also carry attributes specific to the relationship itself, like `enrolled_at` on a student/course enrollment.

What is referential integrity, and what do ON DELETE/ON UPDATE CASCADE, SET NULL, and RESTRICT do?

Referential integrity means a foreign key value always points to a row that actually exists (or is NULL, if nullable) — the database never allows a "dangling" reference. `ON DELETE`/`ON UPDATE` clauses define what happens to dependent rows when the referenced row is deleted or its key changes: `CASCADE` propagates the change/deletion, `SET NULL` nulls out the foreign key, `RESTRICT`/`NO ACTION` blocks the operation if dependents exist, and `SET DEFAULT` resets it to a default value.

What is an ER diagram, and how do you go from a conceptual model to a physical schema?

An Entity-Relationship (ER) diagram models the *business domain* — entities (things), their attributes, and the relationships between them (one-to-one, one-to-many, many-to-many) — independent of any specific database engine. Going to a physical schema means translating entities into tables, attributes into typed columns, and relationships into foreign keys or junction tables, then applying normalization, indexing, and engine-specific type choices.

How do you model inheritance/polymorphic relationships in a relational schema?

There's no native "inheritance" in relational modeling, so it's approximated with one of three patterns: **single table inheritance** (one wide table with a type discriminator column and nullable columns for subtype-specific fields), **class table inheritance** (a base table plus one table per subtype sharing the base's primary key), or **concrete table inheritance** (a fully separate table per subtype, duplicating shared columns). Each trades off query simplicity, storage efficiency, and referential integrity differently.

Data Modeling and Normalization

Designing relational schemas — normal forms, keys, constraints, and modeling relationships correctly.

Difficulty

Open as page

Normalization is a sequence of rules for structuring tables so that each fact is stored exactly once, eliminating anomalies that occur when you insert, update, or delete data.

1NF — Atomic values, no repeating groups

Every column must hold a single, indivisible value — no comma-separated lists, no repeating column groups (phone1, phone2, phone3).

-- Violates 1NF: phones column holds multiple values
| id | name  | phones                |
|----|-------|------------------------|
| 1  | Alice | 555-1234, 555-5678     |

-- 1NF: one row per phone number
| id | name  | phone     |
|----|-------|-----------|
| 1  | Alice | 555-1234  |
| 1  | Alice | 555-5678  |

2NF — No partial dependency on a composite key

Applies when the primary key is composite (more than one column). Every non-key column must depend on the entire key, not just part of it.

-- order_items(order_id, product_id, product_name, quantity)
-- Primary key: (order_id, product_id)
-- Violation: product_name depends only on product_id, not on the full (order_id, product_id) key

Fix: move product_name to a separate products table keyed by product_id alone, and keep order_items holding only columns that truly depend on the composite key (like quantity).

3NF — No transitive dependency

A non-key column must depend on the key directly — not on another non-key column.

-- employees(id, name, department_id, department_name)
-- Violation: department_name depends on department_id, not directly on id (the key) —
-- this is a transitive dependency: id -> department_id -> department_name

Fix: move department_name into a departments table keyed by department_id; employees keeps only department_id as a foreign key.

BCNF — Every determinant is a candidate key

A stricter version of 3NF: for every functional dependency A -> B, A must be a candidate key. 3NF has a narrow exception BCNF closes — it can happen when a table has multiple overlapping composite candidate keys. Example: a table (student, course, instructor) where each instructor teaches only one course, but a course can have multiple instructors:

-- (student, course) is a candidate key; (student, instructor) is also a candidate key
-- but "instructor -> course" is a dependency where instructor is NOT a candidate key alone
-- This satisfies 3NF but violates BCNF, and can still produce redundancy/anomalies.

Why this matters practically

Each level removes a class of anomaly:

Update anomaly: a fact stored in multiple rows can go out of sync if only some copies are updated (e.g., department_name duplicated across every employee row).
Insert anomaly: you can't record a fact (e.g., a new department) until an unrelated fact (an employee in it) also exists.
Delete anomaly: deleting the last row referencing a fact accidentally deletes the fact itself (e.g., deleting the last employee in a department loses the department's name entirely).

Most production schemas target 3NF as a practical default, reaching for BCNF only when the overlapping-key scenario actually arises, and deliberately denormalizing past 3NF for specific, measured performance reasons (see the denormalization question).

Related Resources

Wikipedia: Database Normalization

Open as page

Normalization optimizes for data integrity and minimal redundancy; denormalization trades some of that integrity for read performance. It's a deliberate, measured decision — not a shortcut for skipping schema design.

Common denormalization patterns

Duplicating a rarely-changing lookup value to avoid a join:

-- Instead of always joining orders -> customers for the customer's name,
-- store a snapshot of it directly on the order (useful when the "name at time of order" matters anyway).
ALTER TABLE orders ADD COLUMN customer_name_snapshot VARCHAR(200);

Precomputing an aggregate instead of recalculating it on every read:

-- Instead of SUM(order_items.price) on every page load,
-- maintain orders.total_amount, updated via trigger or application logic when items change.

Flattening a hierarchy for fast lookups (materialized path / closure table):

-- Instead of recursively walking a category tree on every query,
-- store a precomputed ancestor path: categories.path = '/1/14/152/'

Materialized views — the database-native way to denormalize without hand-rolled duplication logic:

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT sale_date, SUM(amount) AS total
FROM sales
GROUP BY sale_date;

REFRESH MATERIALIZED VIEW daily_revenue;  -- periodically, or on a trigger

When it's justified

Read-heavy, write-light workloads where the same expensive join/aggregation runs on nearly every request (product listing pages, dashboards).
Reporting/analytics/OLAP schemas (star/snowflake schemas), where normalized OLTP tables are intentionally flattened into fact/dimension tables for query simplicity and speed.
Historical/audit accuracy — sometimes you want a snapshot value that doesn't change even if the source row later does (an order's shipping address shouldn't retroactively change if the customer later edits their profile address).

The cost you're accepting

Every denormalized copy is a place data can drift out of sync if the write path that keeps it updated has a bug, or if a write path is later added that forgets about the duplicate. Mitigate this with database triggers, transactional application-level updates, or by treating denormalized structures as fully derived/rebuildable (like materialized views) rather than hand-maintained. Never denormalize by default — start normalized, and only denormalize a specific, measured hot path once you can show it's actually a bottleneck.

Related Resources

PostgreSQL: Materialized Views

Open as page

CREATE TABLE users (
    id       SERIAL PRIMARY KEY,          -- chosen primary key
    email    VARCHAR(255) UNIQUE NOT NULL, -- also a candidate key, just not chosen as primary
    username VARCHAR(50)  UNIQUE NOT NULL  -- another candidate key
);

CREATE TABLE orders (
    id      SERIAL PRIMARY KEY,
    user_id INT NOT NULL REFERENCES users(id)  -- foreign key
);

Candidate key

Any column (or minimal combination of columns) that could uniquely identify a row. users above has three candidate keys: id, email, and username — each alone is sufficient to find exactly one row, and none of them can be shrunk further and remain unique (that minimality requirement is what separates a candidate key from just "any unique combination").

Primary key

The one candidate key selected as the table's canonical identifier. Practical differences from other candidate keys:

Implicitly NOT NULL (a candidate key enforced only with UNIQUE can still allow NULL, depending on engine).
Used by default as the target of foreign keys from other tables.
Often backs the table's clustering/physical storage order (see clustered index question).
A table can have only one primary key, but multiple other UNIQUE constraints (candidate keys) alongside it.

Foreign key

A column in the referencing table whose values must match an existing value in the referenced table's primary/unique key (or be NULL, if the column is nullable). It's the mechanism that enforces relationships between tables:

INSERT INTO orders (user_id) VALUES (999);
-- ERROR: violates foreign key constraint — no user with id 999 exists

Foreign keys also govern cascade behavior (ON DELETE CASCADE, ON DELETE SET NULL, ON DELETE RESTRICT) — see the referential integrity question for details.

Why the distinction matters in interviews

Interviewers use this question to check that you understand uniqueness (candidate key) is a broader concept than the chosen identifier (primary key), and that a foreign key isn't a special data type — it's a constraint enforcing that a value in one table must correspond to a real row in another.

Related Resources

PostgreSQL: Constraints

Open as page

-- Surrogate key: meaningless, generated
CREATE TABLE products (
    id SERIAL PRIMARY KEY,        -- surrogate
    sku VARCHAR(50) UNIQUE        -- natural key, still enforced unique
);

-- Using a natural key directly as primary key
CREATE TABLE countries (
    iso_code CHAR(2) PRIMARY KEY  -- natural key: 'US', 'GB', 'DE'
);

Natural key

Pros:

Self-documenting — iso_code = 'US' is meaningful without a join.
No extra lookup needed if the business already has the value on hand.

Cons:

Business "invariants" break more often than expected — social security numbers get reissued in rare cases, email addresses get reused after account deletion, product SKUs get renumbered during a re-branding. Once a natural key is used as a foreign key target in many other tables, correcting it later means cascading updates everywhere.
Often wider/composite (e.g., a natural key might need multiple columns), which makes every foreign key referencing it wider too, increasing index size and join cost.

Surrogate key

Pros:

Guaranteed stable — it's meaningless, so there's never a business reason to change it.
Usually a single, small, fixed-width column (integer or UUID), which keeps foreign keys and their indexes compact and fast to join.
Decouples the database's internal identity from business rules, which can (and do) change.

Cons:

Meaningless in isolation — reading a raw id = 8271 tells you nothing without a lookup.
Still need a UNIQUE constraint on the natural key anyway if it must remain unique for business reasons (as in the sku example above) — the surrogate key doesn't eliminate the need to validate real-world uniqueness, it just avoids using that value as the relational identifier.

Auto-increment integer vs UUID as a surrogate

Auto-increment integer: compact (4–8 bytes), sequential, fast for B-tree index inserts (append-mostly), but reveals row count/creation order and doesn't work well for merging data generated across multiple independent systems (collisions).
UUID: globally unique without coordination (good for distributed/offline-generated IDs, merging data from multiple sources), but larger (16 bytes), and random UUIDs (v4) cause index fragmentation because inserts land at random points in a B-tree rather than appending at the end — UUIDv7 (time-ordered) mitigates this by keeping inserts roughly sequential while remaining globally unique.

Default to a surrogate primary key for internal relational integrity, and add a UNIQUE constraint on any natural-key column that truly must be unique for business reasons. This gives you the stability of a surrogate key without giving up validation of real-world uniqueness.

Related Resources

PostgreSQL: Data Types (UUID)

Open as page

CREATE TABLE products (
    id         SERIAL PRIMARY KEY,                       -- PK: unique + not null
    sku        VARCHAR(50) NOT NULL UNIQUE,               -- required, no duplicates
    price      NUMERIC(10,2) NOT NULL CHECK (price >= 0), -- arbitrary boolean rule
    category_id INT REFERENCES categories(id),            -- FK: must exist in categories
    created_at TIMESTAMP NOT NULL DEFAULT now(),           -- default value if omitted
    status     VARCHAR(20) NOT NULL DEFAULT 'active'
               CHECK (status IN ('active', 'discontinued', 'draft'))
);

NOT NULL

Rejects inserts/updates that omit a value for that column. The simplest and most impactful constraint — a huge share of real-world "why is this NULL when it shouldn't be" bugs are just missing NOT NULL constraints that should have been there from the start.

UNIQUE

Guarantees no two rows share the same value in that column (or combination of columns, for a composite UNIQUE constraint). Backed internally by a unique index. Most engines allow multiple NULLs in a UNIQUE column, since NULL <> NULL under three-valued logic (see the NULL question) — NULL isn't considered a duplicate of another NULL.

PRIMARY KEY

Shorthand for NOT NULL + UNIQUE, plus the semantic meaning of "this is the table's canonical row identifier" and (in most engines) the default target for foreign keys and often the physical clustering key.

FOREIGN KEY

Requires the column's value to match an existing value in the referenced table's primary/unique key, or be NULL if the column is nullable. Enforces relational integrity — you can't have an order pointing at a customer that doesn't exist.

CHECK

An arbitrary boolean expression evaluated per row on insert/update — the most flexible constraint, used for business rules a simple type/uniqueness check can't express (price >= 0, end_date > start_date, status IN (...)).

DEFAULT

Not a validity constraint per se, but supplies a value automatically when a column is omitted from an INSERT — commonly used for created_at timestamps, or a sensible default status/flag.

Why enforce these in the database rather than just in application code

Constraints are the last line of defense against bad data — they catch bugs from any write path (a new microservice, a one-off migration script, a manual psql fix) that an application-layer validation library only protects if every writer remembers to call it. They also let the query optimizer make stronger assumptions (e.g., a NOT NULL foreign key guarantees a matching row exists, which can simplify join planning).

Related Resources

PostgreSQL: Constraints

Open as page

Consider students and courses: one student can take many courses, and one course can have many students — a genuine many-to-many.

The junction table

CREATE TABLE students (id SERIAL PRIMARY KEY, name VARCHAR(100));
CREATE TABLE courses  (id SERIAL PRIMARY KEY, title VARCHAR(200));

CREATE TABLE enrollments (
    student_id INT NOT NULL REFERENCES students(id),
    course_id  INT NOT NULL REFERENCES courses(id),
    enrolled_at TIMESTAMP NOT NULL DEFAULT now(),
    grade      CHAR(2),
    PRIMARY KEY (student_id, course_id)   -- composite PK: a student can't enroll in the same course twice
);

The composite primary key (student_id, course_id) both links the two tables and enforces that a given pairing can only exist once. If duplicate enrollments should be allowed (e.g., retaking a course in a different term), you'd instead use a surrogate id primary key plus a separate UNIQUE (student_id, course_id, term) constraint that includes whatever column distinguishes legitimate repeats.

Querying through the junction table

-- All courses a given student is enrolled in
SELECT c.title
FROM courses c
JOIN enrollments e ON e.course_id = c.id
WHERE e.student_id = 42;

-- All students enrolled in a given course, with their grade
SELECT s.name, e.grade
FROM students s
JOIN enrollments e ON e.student_id = s.id
WHERE e.course_id = 7;

Why the junction table is also the right place for relationship-specific attributes

A many-to-many relationship often has data that belongs to the pairing, not to either side alone — enrolled_at and grade above only make sense in the context of "this student in this course," not on students or courses individually. This is a strong signal you need a real junction table rather than trying to shoehorn the relationship into either parent table.

Indexing note

Beyond the composite primary key (which already indexes student_id first), it's usually worth adding a secondary index on course_id alone (or (course_id, student_id)) if you frequently query "students in this course" — a composite primary key (student_id, course_id) doesn't efficiently support lookups that start from course_id alone.

Related Resources

PostgreSQL: Constraints (Composite keys)

Open as page

CREATE TABLE departments (id SERIAL PRIMARY KEY, name VARCHAR(100));

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department_id INT REFERENCES departments(id)
        ON DELETE SET NULL
        ON UPDATE CASCADE
);

The four referential actions

Action	On DELETE of parent row	On UPDATE of parent key
`CASCADE`	Delete dependent rows too	Update the FK value to match
`SET NULL`	Set the FK column to `NULL`	Set the FK column to `NULL`
`SET DEFAULT`	Set the FK column to its `DEFAULT`	Set the FK column to its `DEFAULT`
`RESTRICT` / `NO ACTION`	Block the delete if dependents exist	Block the update if dependents exist

(RESTRICT and NO ACTION differ subtly — NO ACTION allows the check to be deferred until the end of the transaction in engines that support deferred constraints, RESTRICT never does — but both reject the operation by default.)

Choosing the right action

CASCADE on DELETE — appropriate when the dependent rows have no meaning without the parent: deleting a blog_post should delete its comments.
SET NULL on DELETE — appropriate when the dependent row should survive but lose the association: deleting a department shouldn't delete its employees, just unassign them (requires the FK column to be nullable).
RESTRICT/NO ACTION (the default in most engines) — the safest default for anything financially or legally significant: you generally don't want deleting a customer to silently cascade-delete years of orders history. Force an explicit decision (archive first, reassign orders, etc.) rather than letting a delete cascade silently.
CASCADE on UPDATE is mostly relevant only if your primary keys can ever change (uncommon with surrogate keys, more relevant if you used a natural key that could be corrected, like a mistyped SKU).

Why this matters beyond the syntax

Choosing CASCADE casually is one of the more dangerous schema decisions a team can make — a single DELETE FROM departments WHERE id = 5 can silently fan out and delete far more data than the person running it expects, with no natural "are you sure" checkpoint. Interviewers often use this question to probe whether you default to the safe option (RESTRICT, forcing deliberate handling) rather than reaching for CASCADE purely for developer convenience.

Related Resources

PostgreSQL: Foreign Keys

Open as page

The three modeling levels

Conceptual model — high-level entities and relationships, no attributes or types yet: "a Customer places Orders," "an Order contains Products." Aimed at communicating with non-technical stakeholders.
Logical model (ER diagram) — adds attributes, keys, and cardinality (1:1, 1:N, N:M) to each entity/relationship, still independent of any specific database engine's syntax.
Physical model — the actual CREATE TABLE statements for a specific engine, with concrete data types, indexes, constraints, and partitioning decisions.

Translating ER concepts into tables

ER concept	Physical schema equivalent
Entity (e.g., `Customer`)	A table
Attribute (e.g., `email`)	A column, with a chosen data type
One-to-many relationship	A foreign key on the "many" side pointing to the "one" side's primary key
Many-to-many relationship	A junction table with foreign keys to both sides
One-to-one relationship	A foreign key on either side with a `UNIQUE` constraint (or merge into one table if the split has no real justification)
Weak entity (can't exist without its parent)	A table whose primary key includes (or is entirely derived from) the parent's key, often with `ON DELETE CASCADE`

Example walkthrough

Conceptual: "A Customer places many Orders; each Order contains many Products (and each Product can be in many Orders)."

CREATE TABLE customers (id SERIAL PRIMARY KEY, name VARCHAR(100));

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL REFERENCES customers(id),  -- one-to-many
    ordered_at TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE products (id SERIAL PRIMARY KEY, name VARCHAR(200), price NUMERIC(10,2));

CREATE TABLE order_items (               -- junction table for the many-to-many
    order_id INT NOT NULL REFERENCES orders(id),
    product_id INT NOT NULL REFERENCES products(id),
    quantity INT NOT NULL DEFAULT 1,
    unit_price NUMERIC(10,2) NOT NULL,   -- price at time of order (denormalized, deliberately)
    PRIMARY KEY (order_id, product_id)
);

Practical process

Identify entities and their natural-language relationships with stakeholders — resist jumping straight to tables.
Assign cardinality to every relationship (1:1, 1:N, N:M) — this alone determines whether you need a foreign key or a junction table.
Normalize to at least 3NF as a starting point.
Choose primary keys (surrogate vs natural — see that question) and concrete data types.
Add constraints (NOT NULL, CHECK, UNIQUE) to encode business rules directly in the schema.
Only then consider indexes and any deliberate denormalization, based on known query patterns — not before the logical model is solid.

Related Resources

Wikipedia: Entity–Relationship Model

Open as page

Suppose you have Vehicle as a concept, with subtypes Car (has num_doors) and Truck (has cargo_capacity). Relational databases have no built-in "subtype" concept, so you pick one of three patterns.

1. Single Table Inheritance (STI)

One table for all subtypes, with a discriminator column and nullable subtype-specific columns:

CREATE TABLE vehicles (
    id SERIAL PRIMARY KEY,
    type VARCHAR(20) NOT NULL,   -- discriminator: 'car' or 'truck'
    make VARCHAR(50),
    model VARCHAR(50),
    num_doors INT,               -- only meaningful when type = 'car'
    cargo_capacity NUMERIC(10,2) -- only meaningful when type = 'truck'
);

Pros: simplest queries (no joins to get a full record), easy to add a new shared field. Cons: many nullable columns that are only meaningful for some rows, no way to enforce "car rows must have num_doors set" with a plain NOT NULL (needs a CHECK tied to type), and the table grows wide as subtypes multiply.

2. Class Table Inheritance (CTI)

A base table plus one table per subtype, sharing the same primary key value:

CREATE TABLE vehicles (id SERIAL PRIMARY KEY, make VARCHAR(50), model VARCHAR(50));

CREATE TABLE cars (
    vehicle_id INT PRIMARY KEY REFERENCES vehicles(id),
    num_doors INT NOT NULL
);

CREATE TABLE trucks (
    vehicle_id INT PRIMARY KEY REFERENCES vehicles(id),
    cargo_capacity NUMERIC(10,2) NOT NULL
);

Pros: no wasted nullable columns, NOT NULL works normally within each subtype table, closely mirrors an OOP inheritance hierarchy. Cons: fetching a full car record requires a join across vehicles and cars; nothing in plain SQL stops a vehicle_id from (incorrectly) appearing in both cars and trucks without an extra constraint.

3. Concrete Table Inheritance

A fully separate, self-contained table per subtype, duplicating the shared columns:

CREATE TABLE cars (id SERIAL PRIMARY KEY, make VARCHAR(50), model VARCHAR(50), num_doors INT NOT NULL);
CREATE TABLE trucks (id SERIAL PRIMARY KEY, make VARCHAR(50), model VARCHAR(50), cargo_capacity NUMERIC(10,2) NOT NULL);

Pros: each table is simple and self-contained, no joins needed for a single subtype. Cons: querying "all vehicles regardless of type" needs a UNION, shared columns (make, model) are duplicated in every subtype's schema, and there's no single foreign-key target if some other table needs to reference "any vehicle."

Few, rarely-changing subtypes with mostly shared columns → STI is often good enough and keeps queries simple.
Many subtypes with substantially different, non-overlapping columns, and you need NOT NULL correctness per subtype → CTI.
Subtypes that are queried almost entirely independently and never need a unified "any vehicle" view → concrete table inheritance.

This is also a place where document databases (see the NoSQL topic) sometimes fit more naturally than a relational schema, since a document can freely have subtype-specific fields without any of these three tradeoffs — worth mentioning if the interview also touches polyglot persistence.

Related Resources

Martin Fowler: Single Table Inheritance