Vertical scaling — bigger machine
Before: 8 vCPU, 32GB RAM database server
After: 32 vCPU, 128GB RAM database server (same single server, upgraded)
Pros: requires no application-level changes — the database is still one logical instance, transactions and joins work exactly as before, no new distributed-systems concerns introduced. Simplest option to reason about.
Cons: there's a hard physical/economic ceiling — eventually you run out of bigger hardware to buy, or it becomes prohibitively expensive. It also doesn't improve availability — a single, larger server is still a single point of failure; if it goes down, the whole database is down.
Horizontal scaling — more machines
Before: 1 database server handling all reads and writes
After: 1 primary (writes) + several read replicas (reads),
or several shards, each holding a subset of the data
Pros: much higher scaling ceiling (in principle, keep adding machines), and can improve availability (a replica can be promoted if the primary fails — see the failover question).
Cons: introduces real distributed-systems complexity — replication lag, choosing a sharding key (see that question), cross-shard queries/joins becoming expensive or impossible, and generally more operational surface area (more machines to monitor, patch, and reason about failure modes for).
How they typically combine in practice
Most systems scale vertically first (it's cheap and simple, and modern hardware ceilings are quite high) and only reach for horizontal scaling once vertical scaling is exhausted or availability requirements demand redundancy regardless of raw capacity needs. Read scaling is usually the first horizontal step (read replicas — see that question), since most application workloads are read-heavy and reads are easier to distribute than writes; write scaling (sharding) is a bigger architectural commitment, usually reserved for when a single primary genuinely can't keep up with write volume.
A strong answer recognizes vertical scaling isn't "the naive option to outgrow" — it's often the right first move because of its simplicity, and premature horizontal scaling (sharding a dataset that would fit comfortably on a bigger single server) adds real complexity for no corresponding benefit.
Related Resources
The basic topology
Writes
|
v
[ Primary ]
/ | \
v v v
[Replica1][Replica2][Replica3] <- receive copies of every write
Replicas apply the same stream of changes the primary made (often via shipping the write-ahead log — see that question) so their data converges to match the primary's, with some delay.
Asynchronous replication
Client -> Primary: write X
Primary -> Client: "success" (acknowledged immediately)
Primary -> Replicas: ships the change (happens after acknowledging the client)
The primary doesn't wait for any replica to confirm receipt before telling the client the write succeeded. Pros: lowest possible write latency, since the client isn't waiting on network round-trips to remote replicas. Cons: if the primary crashes after acknowledging the client but before a replica received the change, that write is lost if a replica is promoted to primary — the replica genuinely never had it.
Synchronous replication
Client -> Primary: write X
Primary -> Replica: ships the change
Replica -> Primary: "received and applied"
Primary -> Client: "success" (only now, after replica confirmation)
Pros: zero data loss on failover — by the time the client is told "success," at least one replica genuinely has the data too, so promoting that replica loses nothing. Cons: meaningfully higher write latency (every write waits on a network round-trip to the replica, and if the replica is slow or unreachable, writes stall or fail depending on configuration) — this cost is paid on every single write, permanently, not just during a failure.
The realistic middle ground: semi-synchronous / quorum-based
Many production systems use a middle configuration — e.g., wait for confirmation from at least one of several replicas (not all), or a quorum, balancing durability guarantees against latency. PostgreSQL supports synchronous_commit tuning with options like remote_write/remote_apply and can designate specific replicas as synchronous while others remain asynchronous, letting you tune exactly how much durability guarantee you're paying latency for.
Choose based on how costly losing the most recent few writes actually is: financial transactions or anything where "we told the customer it succeeded, then it disappeared" is unacceptable strongly favors synchronous (or at least semi-synchronous/quorum) replication for the primary write path; less critical data (analytics events, non-critical logs) can usually tolerate asynchronous replication's small window of potential loss in exchange for consistently lower write latency.
Related Resources
Why shard at all
A single database server has a ceiling on write throughput and total storage, no matter how much you scale vertically. Sharding splits the dataset itself across multiple independent servers, each handling only its own subset of the data — write throughput and storage both scale roughly linearly with the number of shards, unlike read replicas (which copy the entire dataset onto each replica and only help with read scaling, not write or storage scaling).
Range-based sharding
Shard 1: orders where order_date < 2024-01-01
Shard 2: orders where 2024-01-01 <= order_date < 2025-01-01
Shard 3: orders where order_date >= 2025-01-01
Pros: range queries (WHERE order_date BETWEEN ...) can often be satisfied by a single shard or a small contiguous set of shards. Cons: prone to "hot shards" if writes cluster in a narrow range — e.g., all current writes land on the newest shard (today's date range), leaving older shards idle while the newest shard bears all the write load.
Hash-based sharding
shard_number = hash(customer_id) % number_of_shards
Pros: spreads data (and write load) evenly across shards, since a good hash function distributes keys uniformly regardless of any natural clustering in the original values. Cons: range queries become expensive — "all orders in January" no longer maps to one shard, since hashing destroys the original ordering, so satisfying that query means fanning out to every shard and merging results (a "scatter-gather" query).
Directory-based sharding
Lookup service: customer_id -> shard_3
customer_id -> shard_1
... (explicit mapping, stored and consulted per lookup)
Pros: maximum flexibility — shards can be rebalanced by simply updating the directory's mapping for affected keys, without needing to recompute a hash function or redefine ranges. Cons: the directory/lookup service itself becomes a critical dependency and potential bottleneck/single point of failure that must itself be highly available and fast.
Choosing a shard key — the most consequential decision
A shard key that's too low-cardinality, or that correlates with request "hotness" (e.g., sharding by country when 80% of traffic is from one country), creates a hot shard that bottlenecks the whole system regardless of how many shards exist. A shard key that doesn't align with your most common query patterns forces expensive cross-shard "scatter-gather" queries for routine operations. Good shard keys are high-cardinality, roughly evenly distributed in both storage and access frequency, and align with how data is most commonly queried (ideally, most queries can be satisfied by a single shard once the key is known).
What sharding costs you
- Cross-shard joins/transactions become hard or impossible in the general case — a join across two entities that happen to live on different shards either isn't supported natively or requires application-level fan-out and merging.
- Rebalancing is operationally complex — adding a new shard usually means migrating a subset of data from existing shards to the new one without downtime, a genuinely hard distributed-systems problem many databases (Vitess, Citus for PostgreSQL, MongoDB's native sharding) have built significant tooling around.
- Uneven growth over time can gradually recreate hot shards even with an initially good key choice, requiring ongoing monitoring and occasional rebalancing.
Sharding is a significant architectural commitment that should be a last resort after read replicas, caching, and vertical scaling are exhausted — not a default reached for early, given the ongoing operational complexity it introduces.
Related Resources
The basic pattern
[ Primary ] <- all WRITES go here
/ | \
v v v
[Replica1][Replica2][Replica3] <- READS distributed across these
Application code routes writes to the primary and (some or all) reads to whichever replica is available/least loaded — often via a proxy/load balancer, or explicit read/write connection strings configured in the application.
Why this helps
Most application workloads are read-heavy (often 80-95%+ of database operations are reads). Since replicas can serve reads independently and in parallel, adding more replicas increases total read capacity roughly linearly, without touching the primary's write capacity at all — a much simpler scaling lever than sharding, and doesn't introduce cross-shard query complexity.
The cost: replication lag
Replicas apply changes slightly after the primary commits them (whether via async or sync replication — see that question), so a read hitting a replica immediately after a related write might not see that write yet.
1. Client writes new profile picture URL to the primary. [committed]
2. Client immediately reads their profile from a replica.
3. Replica hasn't received the replicated change yet -- shows the OLD picture.
This is the classic "read-your-own-writes" consistency problem with read replicas — a real, common source of confusing bug reports ("I just changed X and it still shows the old value!"). Mitigations: route a user's own immediate follow-up reads to the primary for a short window after they write, use "read-after-write" consistency features some managed database services provide, or accept the staleness for read paths where it's genuinely not user-visible/critical.
What replicas don't help with
Read replicas scale read throughput, not write throughput or total storage — every replica holds a full copy of the entire dataset, and every write still has to go through (and be replicated from) the single primary. If write volume, not read volume, is the actual bottleneck, replicas don't help — that's what sharding addresses instead.
Read replicas are usually the first and easiest horizontal scaling step for a read-heavy application, since they require far less architectural change than sharding — mostly routing logic in the application/connection layer, rather than redesigning the data model around a shard key. They're a natural fit for reporting/analytics queries too, since routing expensive analytical reads to a dedicated replica isolates that load from the primary's transactional workload entirely.
Related Resources
The failover sequence
1. Primary is healthy, replicating to Replica A and Replica B.
2. Primary crashes (hardware failure, network partition, etc.).
3. A monitoring/orchestration system detects the primary is unresponsive
(via missed heartbeats over some threshold).
4. The most up-to-date, healthy replica (say, Replica A) is PROMOTED to primary.
5. Application connections / a proxy / DNS / a virtual IP are redirected
to point at the newly-promoted Replica A.
6. Replica B is reconfigured to replicate from the new primary (Replica A).
Key metrics that define "how good" a failover strategy is
- RTO (Recovery Time Objective) — how long the system is actually down/unavailable during a failover, from detection to the new primary accepting traffic. Automated failover systems can often achieve RTOs of seconds to low minutes; manual intervention can take much longer.
- RPO (Recovery Point Objective) — how much data (measured in time) could be lost in the worst case. With synchronous replication, RPO can be effectively zero; with asynchronous replication, RPO is bounded by however far behind the promoted replica was at the moment of failure (see the synchronous vs. asynchronous replication question).
Detecting failure correctly is harder than it sounds
A naive health check (a single missed heartbeat) risks false positives — briefly failing over due to a transient network blip, not an actual primary failure — which is disruptive and risky in its own right (a "split-brain" scenario, where both the old primary, which actually recovers a moment later, and the newly-promoted replica both believe they're the primary, is a serious and hard-to-clean-up failure mode). Real HA systems use consensus mechanisms or a quorum of independent observers (not a single health checker) to confirm a primary is truly down before triggering promotion, specifically to avoid this.
Components of a full HA setup
- Replication — at least one replica must be reasonably current to promote.
- Health monitoring / consensus — reliably detects genuine failure without over-triggering on transient issues.
- Automated promotion — a replica is reconfigured to accept writes as the new primary.
- Client redirection — a proxy, load balancer, virtual IP, or DNS update routes traffic to the new primary without requiring every application instance to be manually reconfigured.
- Re-establishing replication topology — surviving replicas need to start following the new primary, and (ideally) the old primary, if it recovers, needs to safely rejoin as a replica rather than as a conflicting second primary.
Managed services vs. self-managed
Cloud-managed database services (AWS RDS/Aurora, Azure SQL, Google Cloud SQL) handle most of this automatically as a built-in feature — often with RTOs in the tens of seconds. Self-managed HA (e.g., PostgreSQL with Patroni + etcd, or MySQL with Orchestrator) requires assembling these pieces explicitly, which is more work but gives more control over exact behavior and thresholds.
Knowing the terms RTO/RPO, and being able to explain the split-brain risk and why naive health-checking is dangerous, demonstrates real operational experience with HA — beyond just "you have a backup server that takes over."