What are best practices for schema migrations in production, especially for zero-downtime deployments?

Detailed Answer

Why naive migrations break zero-downtime deployments

A migration that renames a column, changes its type incompatibly, or drops a column still in use will break any currently-running instance of the old application code the moment it runs — and during a rolling deployment, old and new application code are typically running simultaneously against the same database for some period. A migration must be safe for both code versions to interact with the schema at the same time.

The expand/contract pattern

The standard technique for safely evolving a schema without downtime:

Expand — add the new structure alongside the old, without removing anything yet:

ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);  -- new, nullable, additive

Migrate/backfill — populate the new structure from the old (often in batches, to avoid a single enormous long-running transaction/lock):

UPDATE users SET email_normalized = LOWER(TRIM(email)) WHERE email_normalized IS NULL;

Dual-write — deploy application code that writes to both the old and new columns, while still reading from the old one, ensuring nothing falls out of sync during the transition.
Cut over reads — deploy application code that reads from the new column instead, once you're confident it's fully populated and being kept in sync.
Contract — only after the new code path has been running successfully for a safe period (and old code is fully retired from the rolling deployment), remove the old column:

ALTER TABLE users DROP COLUMN email;

This pattern trades a single risky migration for several small, individually safe, individually reversible steps — each of which leaves the schema in a state both old and new application code can tolerate.

Watch for locking behavior on large tables

Not all ALTER TABLE operations are equally cheap. Adding a nullable column is typically fast/metadata-only in most modern engines. But adding a NOT NULL column with a default, adding certain indexes, or changing a column's type can require rewriting the entire table and taking a long-held lock that blocks reads/writes for the duration — potentially minutes or hours on a very large table, which is unacceptable for a live production system.

-- Can lock the whole table for a long time on large tables, in some engines/versions
ALTER TABLE orders ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'pending';

-- Safer sequence: add nullable first, backfill in batches, THEN add NOT NULL
ALTER TABLE orders ADD COLUMN status VARCHAR(20);
-- ... batch UPDATE to backfill ...
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;  -- now a much cheaper metadata-only check in many engines, once no NULLs remain

Check your specific engine/version's documented locking behavior for each DDL operation — this has genuinely improved over time (recent PostgreSQL versions optimize several previously-locking operations to be much cheaper), so version-specific verification matters more than assuming a rule of thumb still holds.

Tooling for large-scale online schema changes

For very large tables where even a brief lock is unacceptable, dedicated online schema-change tools (gh-ost, pt-online-schema-change for MySQL; similar patterns achievable in PostgreSQL via triggers-based shadow tables, or newer engine-native online DDL support) create a new table with the desired schema, copy data over in the background while capturing ongoing changes, then perform a quick atomic swap — avoiding a long table-level lock entirely.

Recognizing that a migration must be safe for both old and new application code simultaneously (not just "safe for the new code"), and being able to describe the expand/contract pattern concretely, distinguishes real production migration experience from someone who's only run migrations against a database with no concurrent traffic.

What are best practices for schema migrations in production, especially for zero-downtime deployments?

Quick Answer

Detailed Answer

Why naive migrations break zero-downtime deployments

The expand/contract pattern

Watch for locking behavior on large tables

Tooling for large-scale online schema changes

Related Resources

Related Questions