How would you design a backup and disaster recovery strategy for a production database?
Quick Answer
A solid strategy combines full backups (periodic complete snapshots), incremental/differential backups (capturing only changes since the last backup, to reduce storage/time), and continuous write-ahead log archiving (enabling point-in-time recovery to any moment, not just the last backup's timestamp). Store backups off-site/in a different region from the primary, test restores regularly (an untested backup is not a real backup), and define explicit RTO/RPO targets that drive the actual backup frequency and architecture chosen.
Detailed Answer
The building blocks
Full backups — a complete snapshot of the entire database at a point in time. Simple to restore from (just load it), but large and slow to take/store frequently, and any changes since the last full backup are lost if you can only restore from it alone.
Incremental/differential backups — capture only what's changed since the last backup (incremental: since the last backup of any kind; differential: since the last full backup), reducing storage and backup-window time, at the cost of a more complex restore process (apply the full backup, then each subsequent incremental in order).
Write-ahead log (WAL) archiving — continuously shipping the database's write-ahead log (see that question) to durable storage, enabling point-in-time recovery (PITR): restoring to any specific moment, not just the timestamp of the last full/incremental backup. This is what lets you recover to "11:47:32am, one minute before the bad DELETE ran," rather than only to last night's backup (which would lose everything since then).
Full backup (Sunday) -> WAL continuously archived (Mon, Tue, Wed...) -> incident at Wed 2:15pm
Recovery: restore Sunday's full backup, then replay WAL up to Wed 2:14:59pm
Where backups should live
Never solely on the same server/storage as the primary database — a single disk failure, ransomware attack, or data-center-level incident (fire, regional outage) would destroy the primary and its backups together. Best practice: replicate backups to a genuinely separate storage system, ideally in a different physical region/availability zone from the primary.
Test restores — regularly, not hypothetically
An untested backup is, for practical purposes, not a real backup — corruption, incomplete backup scripts, or subtle configuration drift (e.g., a backup script silently failing for weeks without anyone noticing) are common real-world failure modes that only surface when you actually try to restore. Regularly scheduled restore drills (into an isolated environment, verifying data integrity and application functionality against the restored copy) are the only way to have real confidence the backup strategy actually works when it matters.
Let RTO/RPO targets drive the design
- RPO (Recovery Point Objective) — how much data loss (measured in time) is acceptable in the worst case? A low RPO (seconds to minutes) requires continuous WAL archiving/PITR; a higher RPO (hours) might tolerate periodic snapshot-only backups.
- RTO (Recovery Time Objective) — how quickly must the system be back online? A low RTO favors having a warm standby/replica ready to promote (see the failover question) over a from-scratch restore from cold backup storage, which can take hours for a large database.
Additional considerations
- Encryption of backups at rest (they contain the same sensitive data as the live database, and are often stored somewhere with different access controls that need equally rigorous protection).
- Retention policy — how long to keep backups, balancing storage cost, compliance/legal requirements, and the realistic window in which a "we need to recover something from 6 months ago" request might occur.
- Runbook documentation — a written, rehearsed procedure for who does what during an actual disaster, since a crisis is exactly the wrong time to be improvising or hunting for credentials/access.
A strong answer goes beyond "we take backups" to explicitly connect the backup architecture to concrete RTO/RPO numbers, and treats regular restore testing as non-negotiable — this reflects real operational maturity rather than textbook knowledge.