What is sharding, and what are common sharding strategies?
Quick Answer
Sharding splits a dataset horizontally across multiple independent database instances (shards), each holding a subset of the rows, so that both storage and write throughput scale by adding more shards. Common strategies: **range-based** (partition by a value range, e.g., dates), **hash-based** (partition by a hash of the key, spreading data evenly), and **directory-based** (a lookup service maps each key to its shard explicitly). The choice of shard key is the single most consequential decision — a bad choice causes uneven load ("hot shards") or expensive cross-shard queries.
Detailed Answer
Why shard at all
A single database server has a ceiling on write throughput and total storage, no matter how much you scale vertically. Sharding splits the dataset itself across multiple independent servers, each handling only its own subset of the data — write throughput and storage both scale roughly linearly with the number of shards, unlike read replicas (which copy the entire dataset onto each replica and only help with read scaling, not write or storage scaling).
Range-based sharding
Shard 1: orders where order_date < 2024-01-01
Shard 2: orders where 2024-01-01 <= order_date < 2025-01-01
Shard 3: orders where order_date >= 2025-01-01
Pros: range queries (WHERE order_date BETWEEN ...) can often be satisfied by a single shard or a small contiguous set of shards. Cons: prone to "hot shards" if writes cluster in a narrow range — e.g., all current writes land on the newest shard (today's date range), leaving older shards idle while the newest shard bears all the write load.
Hash-based sharding
shard_number = hash(customer_id) % number_of_shards
Pros: spreads data (and write load) evenly across shards, since a good hash function distributes keys uniformly regardless of any natural clustering in the original values. Cons: range queries become expensive — "all orders in January" no longer maps to one shard, since hashing destroys the original ordering, so satisfying that query means fanning out to every shard and merging results (a "scatter-gather" query).
Directory-based sharding
Lookup service: customer_id -> shard_3
customer_id -> shard_1
... (explicit mapping, stored and consulted per lookup)
Pros: maximum flexibility — shards can be rebalanced by simply updating the directory's mapping for affected keys, without needing to recompute a hash function or redefine ranges. Cons: the directory/lookup service itself becomes a critical dependency and potential bottleneck/single point of failure that must itself be highly available and fast.
Choosing a shard key — the most consequential decision
A shard key that's too low-cardinality, or that correlates with request "hotness" (e.g., sharding by country when 80% of traffic is from one country), creates a hot shard that bottlenecks the whole system regardless of how many shards exist. A shard key that doesn't align with your most common query patterns forces expensive cross-shard "scatter-gather" queries for routine operations. Good shard keys are high-cardinality, roughly evenly distributed in both storage and access frequency, and align with how data is most commonly queried (ideally, most queries can be satisfied by a single shard once the key is known).
What sharding costs you
- Cross-shard joins/transactions become hard or impossible in the general case — a join across two entities that happen to live on different shards either isn't supported natively or requires application-level fan-out and merging.
- Rebalancing is operationally complex — adding a new shard usually means migrating a subset of data from existing shards to the new one without downtime, a genuinely hard distributed-systems problem many databases (Vitess, Citus for PostgreSQL, MongoDB's native sharding) have built significant tooling around.
- Uneven growth over time can gradually recreate hot shards even with an initially good key choice, requiring ongoing monitoring and occasional rebalancing.
Sharding is a significant architectural commitment that should be a last resort after read replicas, caching, and vertical scaling are exhausted — not a default reached for early, given the ongoing operational complexity it introduces.