What is a union/layered filesystem, and how does Docker use it?

Detailed Answer

The layered structure of an image

FROM node:20-slim          # Layer 1: the base image itself (many layers internally)
WORKDIR /app                # Layer 2: metadata-only, creates a directory
COPY package.json .          # Layer 3: adds package.json
RUN npm install               # Layer 4: adds node_modules (often the largest layer)
COPY . .                       # Layer 5: adds the rest of the application code

Each instruction that changes the filesystem (COPY, RUN, ADD) produces a new, read-only layer representing just the diff from the layer beneath it — not a full copy of the entire filesystem at that point. docker history myapp:1.0 shows exactly these layers and their individual sizes.

OverlayFS — the modern default union filesystem

Docker's default storage driver on Linux, OverlayFS, merges these stacked layers into what looks like a single, normal filesystem to the running container. Reading a file transparently checks the topmost layer first, falling back through lower layers until the file is found. Writing a file — or the container's own runtime writes — go into a fresh, thin writable layer added on top of all the image's read-only layers when a container starts.

Container's view (merged):        Actual storage (layered):
/app/server.js                    Writable layer:   (container's own runtime writes)
/app/node_modules/...              Layer 5 (read-only): application code
/app/package.json                  Layer 4 (read-only): node_modules
                                    Layer 3 (read-only): package.json
                                    Layer 2 (read-only): WORKDIR metadata
                                    Layer 1 (read-only): base image (node:20-slim)

Copy-on-write — why modifying a file doesn't touch the underlying image layer

If a running container "modifies" a file that exists in a read-only lower layer, OverlayFS doesn't actually change that lower layer at all (it can't — it's read-only and potentially shared with other containers/images). Instead, it copies the file up into the container's own writable layer first, and the modification happens there. This copy-on-write behavior is exactly what allows many containers to be started from the identical same image simultaneously, each with its own independent, isolated writable layer. All of them share the same underlying read-only image layers on disk, with no risk of one container's changes affecting another's, or the original image.

Why layers are cacheable and shareable

Because each layer is content-addressed (identified by a hash of its contents) and immutable once built, Docker can reuse an identical layer across many different images that happen to share it. For example, two different application images both built FROM node:20-slim share every one of that base image's layers on disk, storing that shared content only once, not duplicated per image. This is also the exact mechanism behind Docker's build cache (see the layer caching question): if a layer's inputs haven't changed, Docker reuses the previously-built layer instead of rebuilding it.

Why this matters practically

Understanding the layered/copy-on-write model explains several things. It explains why a large RUN instruction early in a Dockerfile bloats every layer after it, since each subsequent layer's diff is computed relative to an already-large filesystem state. It explains why multi-stage builds (see that question) can discard entire heavyweight build-only layers from the final image. And it explains why an image's total reported size isn't just "the sum of all layers naively added up," but accounts for shared, deduplicated layers across images stored on the same host.