What's the difference between SQL and NoSQL databases?

SQL (relational) databases enforce a fixed schema, use tables with rows/columns, relationships via foreign keys, and typically prioritize strong consistency (ACID transactions). NoSQL databases relax one or more of these — flexible/schemaless documents, denormalized data models, and often prioritize availability/partition tolerance and horizontal scalability over strict consistency (per the CAP theorem). Neither is universally "better" — the choice depends on data shape, consistency needs, and scale requirements.

Explain the CAP theorem and what it means for distributed databases

The CAP theorem states that a distributed data system can only guarantee two of three properties at any given moment during a network partition: **Consistency** (every read sees the most recent write), **Availability** (every request gets a response, even if not the latest data), and **Partition tolerance** (the system keeps working despite network failures splitting it into isolated groups). Since partitions are an unavoidable reality in any real distributed system, the practical choice is really between **CP** (consistent but may refuse requests during a partition) and **AP** (available but may return stale data during a partition).

What are the main NoSQL data models, with an example database for each?

**Document** stores (MongoDB, Couchbase) store semi-structured JSON/BSON-like documents. **Key-value** stores (Redis, DynamoDB) map a unique key to an opaque value, optimized for simple, extremely fast lookups. **Wide-column** stores (Cassandra, HBase, Bigtable) organize data into rows with dynamic, sparse columns grouped into column families, built for massive write scale. **Graph** databases (Neo4j, Amazon Neptune) model data as nodes and relationships, optimized for traversing connections.

When would you choose a document database like MongoDB over a relational database?

Reach for a document database when your data's shape varies significantly between records, when you naturally read/write "the whole object" together rather than joining many normalized pieces, when the schema needs to evolve quickly without coordinated migrations, or when you need to scale writes horizontally more easily than a traditional single-primary relational setup allows. Avoid it when you need strong multi-record transactional guarantees across many related entities, or when your data is genuinely relational (many-to-many, deeply normalized) and would just be fighting the document model.

What is eventual consistency, and when is it an acceptable tradeoff?

Eventual consistency means that after a write, different replicas/nodes may temporarily disagree about the current value, but they're guaranteed to converge to the same value once no further writes occur and replication has had time to propagate. It's an acceptable tradeoff when brief staleness has low real-world cost (view counts, "likes," non-critical caches) but not for correctness-critical data where any window of disagreement is dangerous (account balances, inventory counts that must never oversell, security/permission checks).

What's the difference between BASE and ACID?

ACID (Atomicity, Consistency, Isolation, Durability) prioritizes strict correctness guarantees on every operation — the traditional relational model. BASE (**B**asically **A**vailable, **S**oft state, **E**ventual consistency) is the informal counterpart describing many distributed NoSQL systems' philosophy: prioritize staying available and responsive even under failure, accept that the system's state may be temporarily inconsistent ("soft"), trusting it converges to consistency eventually rather than immediately.

How do you model relationships in a document database — embedding vs. referencing?

**Embedding** nests related data directly inside the parent document, ideal when the related data is always accessed together with the parent and doesn't need to be queried/updated independently. **Referencing** stores just an ID pointing to a document in another collection (similar to a foreign key), better when the related data is large, frequently changes independently, is shared across many parents, or is queried on its own. Most real schemas mix both, choosing per relationship based on access patterns.

What is a graph database, and what problems is it well suited for?

A graph database stores data as nodes (entities) and edges (relationships between them), with both nodes and edges able to carry properties, and is purpose-built for efficiently traversing and querying connections — especially multi-hop, variable-depth paths that are expensive to express as repeated relational joins or recursive CTEs. It excels at social networks, recommendation engines, fraud detection, and knowledge graphs.

Can you use SQL-like querying in NoSQL systems?

Yes, though the specifics vary widely by product — MongoDB has its own JSON-based query language and a powerful aggregation pipeline (conceptually similar to SQL's `WHERE`/`GROUP BY`/joins, expressed as a pipeline of stages), Cassandra has CQL (Cassandra Query Language, syntactically SQL-like but with important semantic restrictions), and some systems (like AWS's N1QL for Couchbase, or BigQuery/Presto over NoSQL-adjacent stores) support genuinely SQL-compatible querying directly. None of these fully replicate relational SQL's join capabilities or query optimizer sophistication, since the underlying storage model isn't relational.

NoSQL and Polyglot Persistence

When and why to reach beyond relational databases — document, key-value, wide-column, and graph stores.

Difficulty

Open as page

Structural differences

	SQL (relational)	NoSQL
Schema	Fixed, defined upfront, enforced by the engine	Flexible/dynamic, often enforced (if at all) by the application
Data model	Tables, rows, columns, normalized via foreign keys	Documents, key-value pairs, wide columns, or graphs — varies by type
Relationships	Joins across normalized tables	Usually denormalized/embedded, or handled application-side
Query language	SQL (largely standardized)	Varies per product (MongoDB query language, Cypher for Neo4j, etc.)
Consistency model	Typically strong (ACID transactions)	Varies — many default to eventual consistency for scalability
Scaling	Traditionally vertical (bigger server); horizontal via sharding is possible but harder to bolt on	Many designed from the ground up for horizontal scaling/sharding

Why NoSQL emerged

Relational databases enforce a fixed schema and strong consistency, which is exactly right for data with stable structure and strict correctness needs (financial ledgers, inventory), but can be a poor fit for: data whose shape varies significantly between records (product catalogs with wildly different attributes per category), extremely high write throughput distributed across many nodes, or workloads where slightly stale reads are an acceptable tradeoff for lower latency and higher availability at scale.

What you give up with most NoSQL systems

Joins — most NoSQL databases either don't support them or support them poorly; the data model is usually designed to avoid needing them by embedding related data together (see the embedding vs. referencing question).
Strict schema enforcement — flexibility cuts both ways: it's easy to evolve the data model, but it's also easy to accumulate inconsistent documents (some missing fields, some using different types for the "same" field) without the database itself catching it.
Multi-record ACID transactions — historically weaker or entirely absent in early NoSQL systems (though many, like MongoDB since v4.0, have since added multi-document transaction support, narrowing this gap).

The realistic modern answer: polyglot persistence

Most non-trivial systems today use both — a relational database for the core transactional data (orders, accounts, inventory) where consistency and relationships matter most, plus one or more NoSQL stores for specific workloads that fit them better (a document store for a flexible content catalog, a key-value store for session/cache data, a graph database for a recommendation engine). The interview-relevant skill isn't "SQL vs. NoSQL" as an either/or — it's recognizing which data shape and consistency requirement each part of a system actually has, and picking the right tool per use case.

Related Resources

MongoDB: SQL vs NoSQL

Open as page

The three properties

Consistency (C): every node returns the most recent write for any given read — all nodes see the same data at the same time.
Availability (A): every request receives a (non-error) response, even if some nodes can't communicate with each other.
Partition tolerance (P): the system continues operating even when network communication between nodes is disrupted (a "partition" — some nodes can't reach others).

Why it's really "pick 2 of 3" only during a partition

In a system with no network partitions, you can actually have both C and A simultaneously — the theorem's bite only applies during an actual partition event. And because partitions are a real, unavoidable fact of distributed systems (networks fail, nodes get cut off, packets get dropped), partition tolerance isn't really optional for any system that's genuinely distributed across multiple nodes — so the real-world choice collapses to CP vs. AP: when a partition happens, do you sacrifice consistency (keep serving requests, possibly with stale data) or sacrifice availability (refuse requests from the cut-off nodes until the partition heals, to guarantee consistency)?

CP systems — consistency over availability during a partition

When a partition occurs, a CP system will refuse to serve (or will block) requests on the minority/cut-off side rather than risk returning stale or conflicting data.

Examples: traditional relational databases in a synchronous-replication configuration, HBase, MongoDB (in its default configuration, favoring consistency via a single primary that must be reachable for writes), ZooKeeper/etcd (consensus-based coordination systems).

AP systems — availability over consistency during a partition

When a partition occurs, an AP system keeps accepting reads/writes on both sides of the partition, accepting that different sides may temporarily disagree — reconciling the divergence once the partition heals (see the eventual consistency question).

Examples: Cassandra, DynamoDB (in its default/eventually-consistent read mode), CouchDB.

Why this isn't really about "SQL vs NoSQL"

CAP is a property of a distributed system's design choices, not an inherent property of "relational" vs. "NoSQL" — a single-node relational database isn't meaningfully subject to CAP at all (there's nothing to partition), but a multi-region relational deployment absolutely is, and some NoSQL systems (like MongoDB by default) actually lean CP rather than AP. Many modern databases also let you tune this per-operation (e.g., Cassandra's per-query consistency levels, DynamoDB's strongly-consistent vs. eventually-consistent reads) rather than being a single fixed system-wide choice.

A strong answer doesn't just recite "Consistency, Availability, Partition tolerance" — it explains why the real tradeoff is CP vs. AP (since P is non-negotiable for a genuinely distributed system), and can name what a specific real system chooses and why that fits its use case (e.g., a banking ledger favoring CP because stale balance reads are unacceptable; a social media "like counter" favoring AP because a few seconds of staleness is an acceptable tradeoff for never showing an error to the user).

Related Resources

IBM: CAP Theorem

Open as page

Document stores — MongoDB, Couchbase, Firestore

Store self-contained, semi-structured documents (typically JSON/BSON), where related data is often embedded directly in one document rather than normalized across tables.

{
  "_id": "user_123",
  "name": "Alice",
  "addresses": [
    {"type": "home", "city": "Austin"},
    {"type": "work", "city": "Dallas"}
  ]
}

Best fit: content with variable/nested structure per record, rapid schema iteration, read patterns that naturally want "the whole object" in one fetch.

Key-value stores — Redis, DynamoDB, Memcached

The simplest model: a unique key maps to an opaque value (string, blob, or a richer structure in Redis's case — lists, sets, hashes). No querying by value content in the basic model — you fetch by key, full stop.

SET session:abc123 '{"user_id": 42, "expires": "2026-07-05T00:00:00Z"}'
GET session:abc123

Best fit: caching, session storage, feature flags, rate limiting — anything with simple, extremely high-throughput lookups by a known key.

Wide-column stores — Cassandra, HBase, Google Bigtable

Rows can have a different, sparse set of columns, and columns are grouped into "column families" stored together on disk — optimized for very high write throughput and horizontal scale across many nodes, with each row addressable by a partition key.

Row key: user_123
  Column family "profile":   name=Alice, email=alice@example.com
  Column family "activity":  last_login=2026-07-01, login_count=57

Best fit: time-series data, IoT sensor readings, massive-scale write-heavy workloads (Cassandra was originally built at Facebook for exactly this kind of scale).

Graph databases — Neo4j, Amazon Neptune, ArangoDB

Model data explicitly as nodes (entities) and edges (relationships), with relationships as first-class citizens that can themselves carry properties — optimized for traversing and querying connections, not just individual records.

MATCH (a:Person {name: 'Alice'})-[:FOLLOWS]->(b:Person)-[:FOLLOWS]->(c:Person)
WHERE NOT (a)-[:FOLLOWS]->(c)
RETURN c AS suggested_follow

Best fit: social networks, recommendation engines, fraud detection (tracing chains of connected transactions), and any domain where "how are these things related, possibly several hops deep" is the core query pattern — a relationship a relational database would need several expensive joins (or a recursive CTE) to express.

Choosing between them

The decision should follow from your actual query patterns: "I always fetch this whole record together" points to document; "I only ever look things up by a single known key" points to key-value; "I write enormous volumes of data that rarely needs complex ad-hoc querying" points to wide-column; "my core question is about relationships/paths between entities" points to graph. Defaulting to relational and only reaching for one of these when the data shape or scale genuinely demands it is usually the right instinct.

Related Resources

AWS: Types of NoSQL Databases

Open as page

Good fit signals

Variable, nested, or evolving schema per record. A product catalog where a "shoe" and a "laptop" have almost entirely different sets of attributes maps naturally onto flexible documents; forcing it into a fixed relational schema means either a very wide table full of mostly-NULL columns, or an entity-attribute-value pattern that's awkward to query.

{ "type": "shoe", "size": 10, "width": "medium", "color": "black" }
{ "type": "laptop", "ram_gb": 16, "screen_inches": 14, "cpu": "M3" }

Read/write "the whole object" as a unit. If your application almost always fetches an entire user profile (with nested preferences, addresses, recent activity) in one shot, storing it as one document avoids the join cost of assembling it from several normalized tables every time.

Rapid schema iteration without coordinated migrations. Adding a new optional field to a document requires no schema migration — new documents can simply include it, older documents without it are handled with a default in application code. A relational schema change (ALTER TABLE ADD COLUMN) is usually safe too in modern engines, but document stores make this iteration even more frictionless for genuinely unstructured/varying data.

Horizontal write scaling. Document databases like MongoDB are built with sharding as a first-class, well-supported feature from the ground up, often making it more straightforward to scale write throughput horizontally than retrofitting sharding onto a relational deployment.

Poor fit signals — stick with relational

Deeply relational data with many meaningful many-to-many relationships. If your domain genuinely has complex, normalized relationships that need to be queried from many different angles (not just "fetch this one entity and its nested children"), you'll end up either duplicating data across many documents (with the sync/consistency problems that brings) or re-implementing joins in application code — usually a worse position than just using a relational database with proper foreign keys.

Strong consistency across multiple related entities. If a business operation must atomically update several distinct entities together with strict guarantees (a financial transaction touching multiple accounts), a relational database's mature multi-table transaction support is the safer, more battle-tested default — even though modern MongoDB does support multi-document ACID transactions, it's a comparatively newer feature.

Complex ad-hoc reporting/analytics across the full dataset. SQL's join and aggregation capabilities, plus the surrounding tooling ecosystem (BI tools, reporting frameworks), are generally more mature for cross-cutting analytical queries than most document databases' query languages.

The realistic answer

Most production systems benefit from evaluating this per data domain, not system-wide — a core "orders and accounts" domain often stays relational for consistency and reporting, while a "product catalog" or "user activity feed" domain might genuinely be better served by a document store, coexisting in the same overall architecture (polyglot persistence).

Related Resources

MongoDB: When to Use MongoDB

Open as page

What it means concretely

Client writes "likes = 101" to Replica A.
Replica A immediately has the new value.
Replica B and Replica C haven't received the replication update yet.

A read hitting Replica B right now might still return "likes = 100" --
temporarily inconsistent with Replica A's already-committed value.

Given enough time (milliseconds to seconds, typically), replication
catches up, and B and C converge to "likes = 101" too.

This is the AP side of the CAP theorem in practice: rather than blocking the read (or the write) until every replica agrees (which would sacrifice availability/latency), the system accepts reads that might be momentarily stale, trusting that convergence happens shortly afterward.

When it's a fine tradeoff

Social media counters (likes, view counts, follower counts) — a few seconds of staleness or a slightly-off count is invisible/irrelevant to users, and the alternative (blocking every like/view to synchronously update every replica) would add unacceptable latency at massive scale for negligible correctness benefit.
Non-critical caches — a cache that's occasionally a few seconds stale is, by definition, an acceptable tradeoff (that's the whole premise of caching).
Content/CDN distribution — a page update propagating to edge servers over a few seconds/minutes is standard and expected.
DNS — the textbook example of eventual consistency at internet scale: DNS record changes propagate over time (per TTL), and the world briefly seeing old vs. new records simultaneously is an accepted, designed-for tradeoff.

When it's dangerous

Financial balances/ledgers — reading a stale balance and allowing a withdrawal based on it can produce genuine financial loss (overdraft that shouldn't have been allowed).
Inventory that must never oversell — two replicas both believing "1 item left" and both allowing a purchase results in overselling.
Security/permission checks — a stale "user has access" read could grant access that was just revoked, a real security gap.
Uniqueness enforcement — two concurrent writes on different replicas, both believing a username is available, can both succeed, violating a uniqueness invariant the business actually depends on.

The skill being tested isn't "know the definition" — it's being able to reason about which specific pieces of data in a system can tolerate eventual consistency and which genuinely can't, and choosing a storage/consistency strategy per data type accordingly (e.g., storing account balances in a strongly-consistent store while storing view counts in an eventually-consistent one, even within the same overall application).

Related Resources

AWS: Eventual Consistency

Open as page

ACID — strict, transactional guarantees

Recall: Atomicity, Consistency, Isolation, Durability (see the ACID question) — a transaction is a strict, all-or-nothing unit that leaves the database in a definitively valid, immediately-consistent state once committed. This is the traditional relational database philosophy, prioritizing correctness even at some cost to availability/latency under failure or contention.

BASE — the informal AP-leaning counterpart

BASE isn't a formal specification the way ACID is a formal set of guarantees — it's a looser, descriptive term (coined specifically as a contrast to ACID) capturing the philosophy behind many distributed NoSQL systems:

Basically Available — the system guarantees a response (availability) even during a partial failure/partition, even if that response might be based on somewhat stale data.
Soft state — the system's state may change over time even without new input, purely due to eventual consistency mechanisms (replication catching up, conflict resolution reconciling divergent writes) — the state isn't a fixed, immediately-settled fact the instant a write happens.
Eventual consistency — given enough time without new writes, all replicas converge to the same value (see that question).

Side-by-side

	ACID	BASE
Priority	Consistency and correctness	Availability and responsiveness
Typical systems	Relational databases (PostgreSQL, MySQL, SQL Server)	Many distributed NoSQL systems (Cassandra, DynamoDB)
Consistency timing	Immediate, guaranteed at commit	Eventual, converges over time
CAP alignment	Leans CP	Leans AP

Why this distinction matters practically

It's less about picking "ACID databases" vs "BASE databases" wholesale, and more about recognizing that these represent two different points on a spectrum of tradeoffs, and that some systems let you choose per-operation (e.g., DynamoDB offers both eventually-consistent reads, cheaper/faster, and strongly-consistent reads, more expensive/slower, on the same table). A candidate who can explain why a system might deliberately choose BASE-style guarantees for some data (accepting brief inconsistency for availability/scale) while insisting on ACID guarantees for other data (a financial ledger) demonstrates a more mature understanding than one who treats "ACID good, BASE bad" or vice versa as a blanket rule.

Related Resources

Wikipedia: Eventual Consistency (BASE)

Open as page

Document databases have no native joins (or only limited, often less efficient support for them, like MongoDB's $lookup), so the modeling decision of embed-vs-reference is one of the most consequential design choices in a document schema.

Embedding — nest the related data directly

{
  "_id": "order_789",
  "customer_name": "Alice",
  "items": [
    { "product": "Widget", "qty": 2, "price": 9.99 },
    { "product": "Gadget", "qty": 1, "price": 19.99 }
  ]
}

Good fit: order line items — they're always fetched together with the order, rarely if ever queried independently of it, and there's a natural "belongs to exactly one parent" (order) relationship. One read fetches the complete, usable object with zero joins.

Referencing — store just an ID, like a foreign key

// customers collection
{ "_id": "cust_123", "name": "Alice", "email": "alice@example.com" }

// orders collection
{ "_id": "order_789", "customer_id": "cust_123", "items": [...] }

Good fit: the customer record — it's large, changes independently of any given order, is referenced by many orders (embedding it would duplicate the customer's full profile into every single order document, and any customer profile update would then need to fan out and update every order that embedded it).

Decision factors

Favor embedding when...	Favor referencing when...
Data is always read together with the parent	Data is often queried/updated independently
One-to-few relationship (a handful of items)	One-to-many-many or many-to-many (shared across many parents)
The child has no independent identity outside the parent	The child is a real standalone entity referenced from multiple places
Document size stays reasonable	Embedding would make documents unbounded/huge (e.g., embedding every comment ever made on a popular post)

The unbounded-growth trap

A common modeling mistake: embedding a collection that can grow indefinitely (e.g., embedding every comment directly inside a blog post document). Most document databases impose a maximum document size (MongoDB: 16MB), and even below that limit, an ever-growing embedded array makes the document progressively more expensive to read/write/reallocate as it grows — this is the classic sign a "child" actually needs its own collection with a reference back to the parent, rather than embedding.

Model per relationship based on actual access patterns, not a blanket rule — it's completely normal (and usually correct) for a single schema to embed some relationships and reference others, mirroring exactly the same judgment call a relational schema designer makes when deciding what to denormalize (see that question) versus keep fully normalized.

Related Resources

MongoDB: Data Modeling Introduction

Open as page

The core idea

Nodes represent entities, edges represent relationships between them, and — critically — both can carry their own properties. Traversing from one node to its neighbors (and their neighbors, and so on) is the fundamental, highly optimized operation, unlike a relational database where each additional "hop" typically means another JOIN.

// Cypher (Neo4j's query language)
CREATE (alice:Person {name: 'Alice'})-[:FOLLOWS]->(bob:Person {name: 'Bob'})
CREATE (bob)-[:FOLLOWS]->(carol:Person {name: 'Carol'})

// Find everyone Alice can reach within 2 hops of "FOLLOWS"
MATCH (a:Person {name: 'Alice'})-[:FOLLOWS*1..2]->(reachable)
RETURN reachable.name;

Why relational databases struggle with this class of query

Modeling the same "who can Alice reach within N hops" question relationally requires either a recursive CTE (see that question) — which works, but re-executes a join per level of recursion and can get slow as depth or fan-out grows — or, for genuinely deep/variable-depth traversal (unknown number of hops, needed at low latency), doesn't scale well at all. Graph databases store adjacency information (which nodes connect to which) in a form optimized for direct traversal — often literally following in-memory pointers between connected nodes — rather than re-computing joins from scratch on every query.

Well-suited use cases

Social networks — friend-of-friend suggestions, mutual connections, degrees of separation.
Recommendation engines — "customers who bought X also bought Y," especially multi-hop recommendations ("people similar to you, who liked things similar to what you liked").
Fraud detection — tracing chains of transactions/accounts to detect rings of related fraudulent activity that would be invisible looking at any single transaction in isolation.
Knowledge graphs — modeling richly interconnected facts (e.g., "this drug interacts with this condition, which is treated by this other drug...") where the relationships themselves are as important as the entities.
Network/IT infrastructure mapping — dependency graphs between services, where "what breaks if this node goes down" is fundamentally a graph-traversal question.

Where a graph database is the wrong tool

Simple, mostly-tabular data with few or shallow relationships gains little from a graph model and loses the mature tooling, familiar query language, and broad ecosystem support relational (or even document) databases offer. Graph databases are a specialized tool for a specific shape of problem — heavily interconnected data queried primarily via traversal/path-finding — not a general-purpose replacement for relational modeling.

Being able to identify why a recursive CTE or repeated self-joins become painful at scale, and articulating that a graph database's storage/traversal model directly targets that pain point, shows a level of understanding beyond just naming Neo4j as "the graph database."

Related Resources

Neo4j: Graph Database Use Cases

Open as page

MongoDB's query language and aggregation pipeline

Simple filtering uses a JSON-based query syntax:

db.orders.find({ status: "shipped", total: { $gt: 100 } });

More complex analytical queries use the aggregation pipeline — a sequence of stages, each transforming the data, conceptually similar to chaining SQL's WHERE → GROUP BY → HAVING → SELECT, but expressed as an explicit list of operations rather than a single declarative statement:

db.orders.aggregate([
  { $match: { status: "shipped" } },                         // like WHERE
  { $group: { _id: "$customer_id", total: { $sum: "$amount" } } }, // like GROUP BY + SUM
  { $sort: { total: -1 } },                                   // like ORDER BY
  { $limit: 10 }                                              // like LIMIT
]);

$lookup provides a join-like operation across collections, though it's generally less performant and less flexible than a relational join, and MongoDB's data modeling philosophy (see the embedding/referencing question) generally tries to minimize how often you need it in the first place.

Cassandra's CQL

Deliberately styled to look like SQL, which lowers the learning curve for SQL-familiar developers:

SELECT * FROM orders WHERE customer_id = 42;

But CQL has significant semantic restrictions compared to real SQL — most importantly, it doesn't support arbitrary joins or ad-hoc WHERE filtering on non-key columns efficiently (you generally must query by the partition key, reflecting how the data is actually physically distributed across the cluster) — a query that "looks like SQL" can still fail or perform terribly if it doesn't match Cassandra's underlying partitioning model.

Genuinely SQL-compatible layers over NoSQL-adjacent storage

Some systems provide a real SQL query layer on top of non-relational or semi-structured storage — Presto/Trino, AWS Athena (SQL over data in S3), Google BigQuery, or N1QL for Couchbase — letting analysts use full, familiar SQL (including real joins) against data that isn't stored in a traditional relational engine underneath.

The honest answer is "it depends heavily on the specific product" — no single universal "NoSQL SQL" exists, and syntactic similarity to SQL (as with CQL) doesn't guarantee semantic/performance similarity. A candidate who knows this nuance (and specifically that "SQL-like syntax" isn't the same as "relational query power") demonstrates real, hands-on familiarity rather than surface-level knowledge of product names.

Related Resources

MongoDB: Aggregation Pipeline