Data Modeling and Normalization

Difficulty

Normalization is a sequence of rules for structuring tables so that each fact is stored exactly once, eliminating anomalies that occur when you insert, update, or delete data.

1NF — Atomic values, no repeating groups

Every column must hold a single, indivisible value — no comma-separated lists, no repeating column groups (phone1, phone2, phone3).

-- Violates 1NF: phones column holds multiple values
| id | name  | phones                |
|----|-------|------------------------|
| 1  | Alice | 555-1234, 555-5678     |

-- 1NF: one row per phone number
| id | name  | phone     |
|----|-------|-----------|
| 1  | Alice | 555-1234  |
| 1  | Alice | 555-5678  |

2NF — No partial dependency on a composite key

Applies when the primary key is composite (more than one column). Every non-key column must depend on the entire key, not just part of it.

-- order_items(order_id, product_id, product_name, quantity)
-- Primary key: (order_id, product_id)
-- Violation: product_name depends only on product_id, not on the full (order_id, product_id) key

Fix: move product_name to a separate products table keyed by product_id alone, and keep order_items holding only columns that truly depend on the composite key (like quantity).

3NF — No transitive dependency

A non-key column must depend on the key directly — not on another non-key column.

-- employees(id, name, department_id, department_name)
-- Violation: department_name depends on department_id, not directly on id (the key) —
-- this is a transitive dependency: id -> department_id -> department_name

Fix: move department_name into a departments table keyed by department_id; employees keeps only department_id as a foreign key.

BCNF — Every determinant is a candidate key

A stricter version of 3NF: for every functional dependency A -> B, A must be a candidate key. 3NF has a narrow exception BCNF closes — it can happen when a table has multiple overlapping composite candidate keys. Example: a table (student, course, instructor) where each instructor teaches only one course, but a course can have multiple instructors:

-- (student, course) is a candidate key; (student, instructor) is also a candidate key
-- but "instructor -> course" is a dependency where instructor is NOT a candidate key alone
-- This satisfies 3NF but violates BCNF, and can still produce redundancy/anomalies.

Why this matters practically

Each level removes a class of anomaly:

  • Update anomaly: a fact stored in multiple rows can go out of sync if only some copies are updated (e.g., department_name duplicated across every employee row).
  • Insert anomaly: you can't record a fact (e.g., a new department) until an unrelated fact (an employee in it) also exists.
  • Delete anomaly: deleting the last row referencing a fact accidentally deletes the fact itself (e.g., deleting the last employee in a department loses the department's name entirely).

Most production schemas target 3NF as a practical default, reaching for BCNF only when the overlapping-key scenario actually arises, and deliberately denormalizing past 3NF for specific, measured performance reasons (see the denormalization question).

Normalization optimizes for data integrity and minimal redundancy; denormalization trades some of that integrity for read performance. It's a deliberate, measured decision — not a shortcut for skipping schema design.

Common denormalization patterns

Duplicating a rarely-changing lookup value to avoid a join:

-- Instead of always joining orders -> customers for the customer's name,
-- store a snapshot of it directly on the order (useful when the "name at time of order" matters anyway).
ALTER TABLE orders ADD COLUMN customer_name_snapshot VARCHAR(200);

Precomputing an aggregate instead of recalculating it on every read:

-- Instead of SUM(order_items.price) on every page load,
-- maintain orders.total_amount, updated via trigger or application logic when items change.

Flattening a hierarchy for fast lookups (materialized path / closure table):

-- Instead of recursively walking a category tree on every query,
-- store a precomputed ancestor path: categories.path = '/1/14/152/'

Materialized views — the database-native way to denormalize without hand-rolled duplication logic:

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT sale_date, SUM(amount) AS total
FROM sales
GROUP BY sale_date;

REFRESH MATERIALIZED VIEW daily_revenue;  -- periodically, or on a trigger

When it's justified

  • Read-heavy, write-light workloads where the same expensive join/aggregation runs on nearly every request (product listing pages, dashboards).
  • Reporting/analytics/OLAP schemas (star/snowflake schemas), where normalized OLTP tables are intentionally flattened into fact/dimension tables for query simplicity and speed.
  • Historical/audit accuracy — sometimes you want a snapshot value that doesn't change even if the source row later does (an order's shipping address shouldn't retroactively change if the customer later edits their profile address).

The cost you're accepting

Every denormalized copy is a place data can drift out of sync if the write path that keeps it updated has a bug, or if a write path is later added that forgets about the duplicate. Mitigate this with database triggers, transactional application-level updates, or by treating denormalized structures as fully derived/rebuildable (like materialized views) rather than hand-maintained. Never denormalize by default — start normalized, and only denormalize a specific, measured hot path once you can show it's actually a bottleneck.

CREATE TABLE users (
    id       SERIAL PRIMARY KEY,          -- chosen primary key
    email    VARCHAR(255) UNIQUE NOT NULL, -- also a candidate key, just not chosen as primary
    username VARCHAR(50)  UNIQUE NOT NULL  -- another candidate key
);

CREATE TABLE orders (
    id      SERIAL PRIMARY KEY,
    user_id INT NOT NULL REFERENCES users(id)  -- foreign key
);

Candidate key

Any column (or minimal combination of columns) that could uniquely identify a row. users above has three candidate keys: id, email, and username — each alone is sufficient to find exactly one row, and none of them can be shrunk further and remain unique (that minimality requirement is what separates a candidate key from just "any unique combination").

Primary key

The one candidate key selected as the table's canonical identifier. Practical differences from other candidate keys:

  • Implicitly NOT NULL (a candidate key enforced only with UNIQUE can still allow NULL, depending on engine).
  • Used by default as the target of foreign keys from other tables.
  • Often backs the table's clustering/physical storage order (see clustered index question).
  • A table can have only one primary key, but multiple other UNIQUE constraints (candidate keys) alongside it.

Foreign key

A column in the referencing table whose values must match an existing value in the referenced table's primary/unique key (or be NULL, if the column is nullable). It's the mechanism that enforces relationships between tables:

INSERT INTO orders (user_id) VALUES (999);
-- ERROR: violates foreign key constraint — no user with id 999 exists

Foreign keys also govern cascade behavior (ON DELETE CASCADE, ON DELETE SET NULL, ON DELETE RESTRICT) — see the referential integrity question for details.

Why the distinction matters in interviews

Interviewers use this question to check that you understand uniqueness (candidate key) is a broader concept than the chosen identifier (primary key), and that a foreign key isn't a special data type — it's a constraint enforcing that a value in one table must correspond to a real row in another.

Related Resources

-- Surrogate key: meaningless, generated
CREATE TABLE products (
    id SERIAL PRIMARY KEY,        -- surrogate
    sku VARCHAR(50) UNIQUE        -- natural key, still enforced unique
);

-- Using a natural key directly as primary key
CREATE TABLE countries (
    iso_code CHAR(2) PRIMARY KEY  -- natural key: 'US', 'GB', 'DE'
);

Natural key

Pros:

  • Self-documenting — iso_code = 'US' is meaningful without a join.
  • No extra lookup needed if the business already has the value on hand.

Cons:

  • Business "invariants" break more often than expected — social security numbers get reissued in rare cases, email addresses get reused after account deletion, product SKUs get renumbered during a re-branding. Once a natural key is used as a foreign key target in many other tables, correcting it later means cascading updates everywhere.
  • Often wider/composite (e.g., a natural key might need multiple columns), which makes every foreign key referencing it wider too, increasing index size and join cost.

Surrogate key

Pros:

  • Guaranteed stable — it's meaningless, so there's never a business reason to change it.
  • Usually a single, small, fixed-width column (integer or UUID), which keeps foreign keys and their indexes compact and fast to join.
  • Decouples the database's internal identity from business rules, which can (and do) change.

Cons:

  • Meaningless in isolation — reading a raw id = 8271 tells you nothing without a lookup.
  • Still need a UNIQUE constraint on the natural key anyway if it must remain unique for business reasons (as in the sku example above) — the surrogate key doesn't eliminate the need to validate real-world uniqueness, it just avoids using that value as the relational identifier.

Auto-increment integer vs UUID as a surrogate

  • Auto-increment integer: compact (4–8 bytes), sequential, fast for B-tree index inserts (append-mostly), but reveals row count/creation order and doesn't work well for merging data generated across multiple independent systems (collisions).
  • UUID: globally unique without coordination (good for distributed/offline-generated IDs, merging data from multiple sources), but larger (16 bytes), and random UUIDs (v4) cause index fragmentation because inserts land at random points in a B-tree rather than appending at the end — UUIDv7 (time-ordered) mitigates this by keeping inserts roughly sequential while remaining globally unique.

Default to a surrogate primary key for internal relational integrity, and add a UNIQUE constraint on any natural-key column that truly must be unique for business reasons. This gives you the stability of a surrogate key without giving up validation of real-world uniqueness.

CREATE TABLE products (
    id         SERIAL PRIMARY KEY,                       -- PK: unique + not null
    sku        VARCHAR(50) NOT NULL UNIQUE,               -- required, no duplicates
    price      NUMERIC(10,2) NOT NULL CHECK (price >= 0), -- arbitrary boolean rule
    category_id INT REFERENCES categories(id),            -- FK: must exist in categories
    created_at TIMESTAMP NOT NULL DEFAULT now(),           -- default value if omitted
    status     VARCHAR(20) NOT NULL DEFAULT 'active'
               CHECK (status IN ('active', 'discontinued', 'draft'))
);

NOT NULL

Rejects inserts/updates that omit a value for that column. The simplest and most impactful constraint — a huge share of real-world "why is this NULL when it shouldn't be" bugs are just missing NOT NULL constraints that should have been there from the start.

UNIQUE

Guarantees no two rows share the same value in that column (or combination of columns, for a composite UNIQUE constraint). Backed internally by a unique index. Most engines allow multiple NULLs in a UNIQUE column, since NULL <> NULL under three-valued logic (see the NULL question) — NULL isn't considered a duplicate of another NULL.

PRIMARY KEY

Shorthand for NOT NULL + UNIQUE, plus the semantic meaning of "this is the table's canonical row identifier" and (in most engines) the default target for foreign keys and often the physical clustering key.

FOREIGN KEY

Requires the column's value to match an existing value in the referenced table's primary/unique key, or be NULL if the column is nullable. Enforces relational integrity — you can't have an order pointing at a customer that doesn't exist.

CHECK

An arbitrary boolean expression evaluated per row on insert/update — the most flexible constraint, used for business rules a simple type/uniqueness check can't express (price >= 0, end_date > start_date, status IN (...)).

DEFAULT

Not a validity constraint per se, but supplies a value automatically when a column is omitted from an INSERT — commonly used for created_at timestamps, or a sensible default status/flag.

Why enforce these in the database rather than just in application code

Constraints are the last line of defense against bad data — they catch bugs from any write path (a new microservice, a one-off migration script, a manual psql fix) that an application-layer validation library only protects if every writer remembers to call it. They also let the query optimizer make stronger assumptions (e.g., a NOT NULL foreign key guarantees a matching row exists, which can simplify join planning).

Related Resources