What are the key Dockerfile instructions?

`FROM` sets the base image. `RUN` executes a command at build time, creating a new layer. `COPY`/`ADD` bring files from the build context into the image. `WORKDIR` sets the working directory for subsequent instructions. `ENV` sets persistent environment variables. `EXPOSE` documents which ports the container listens on (informational only — it doesn't actually publish them). `USER` sets which user subsequent instructions and the container's process run as. `CMD`/`ENTRYPOINT` define what runs when a container starts.

What's the difference between COPY and ADD?

`COPY` does exactly one thing: copies files/directories from the build context into the image — simple, predictable, and the generally recommended default. `ADD` does everything `COPY` does, plus two extra behaviors: it can automatically extract local `.tar` archives into the destination, and it can fetch a remote URL directly into the image. Docker's own official guidance recommends `COPY` unless you specifically need one of `ADD`'s extra behaviors, since `ADD`'s implicit magic (especially auto-extraction) has surprised people in ways that caused real bugs.

What's the difference between ENTRYPOINT and CMD?

`ENTRYPOINT` defines the fixed, primary command a container always runs, while `CMD` provides default *arguments* to that command (or, used alone with no `ENTRYPOINT`, defines the whole default command) that can be easily overridden at `docker run` time. When both are set, `CMD`'s value is passed as arguments to `ENTRYPOINT` — this combination is the standard pattern for building an image that behaves like a fixed executable with sensible, overridable default arguments.

How do Docker image layers work, and how does layer caching affect build speed?

Each Dockerfile instruction that changes the filesystem (`RUN`, `COPY`, `ADD`) produces a new, content-hashed layer; on a subsequent build, Docker checks whether an instruction's inputs (the instruction itself, plus — for `COPY`/`ADD` — the actual contents of the copied files) match a previously-built layer, and if so, **reuses that cached layer instead of re-executing the instruction**. This can make rebuilds dramatically faster, but the cache is invalidated from the first changed instruction onward — every layer after that point must be rebuilt, even if its own inputs are unchanged, which is exactly why instruction order matters so much.

How do you order Dockerfile instructions to maximize cache reuse?

Place instructions that change rarely (installing system packages, installing dependencies from a lockfile) before instructions that change frequently (copying in your actual application source code), so that a typical day-to-day code change only invalidates the cheap, fast final layers — not the expensive dependency-installation step. The general principle: order by "least likely to change" first, "most likely to change" last.

What is a multi-stage build, and what problem does it solve?

A multi-stage build uses multiple `FROM` instructions in a single Dockerfile, each starting a new, independent build stage — letting you use a full-featured image with compilers and build tools in an early stage, then copy only the final compiled artifacts into a separate, minimal final stage. This solves the problem of a production image otherwise being bloated with an entire toolchain (compilers, build dependencies, source code) that's only needed to *produce* the application, not to *run* it.

How do you choose a base image (alpine vs. slim vs. distroless vs. full)?

A **full** base image (e.g., `ubuntu`, `node`) includes a complete OS userland with many common tools — largest, but most compatible and easiest to debug. **Slim** variants strip out documentation, extra utilities, and other non-essential packages — meaningfully smaller with usually few compatibility surprises. **Alpine** uses the musl C library instead of glibc and BusyBox instead of GNU coreutils — often the smallest practical option, but occasional compatibility issues with software expecting glibc specifically. **Distroless** images strip out even the shell and package manager, containing only the application and its direct runtime dependencies — smallest attack surface, but hardest to debug interactively since there's no shell to exec into at all.

What is the build context, and what does .dockerignore do?

The build context is the set of files sent from your machine to the Docker daemon when you run `docker build` — by default, every file in the directory you specify (commonly `.`), which the daemon needs access to in order to satisfy any `COPY`/`ADD` instructions. `.dockerignore` excludes specified files/patterns from being sent as part of that context, which both speeds up the build (less data transferred and hashed) and prevents accidentally including sensitive or unnecessary files (like `.git`, `node_modules`, or `.env` files) in what's sent to the daemon or potentially copied into the image.

What's the difference between an image tag and a digest?

A tag (e.g., `myapp:1.0`, `myapp:latest`) is a mutable, human-friendly pointer that can be reassigned to point at a different underlying image at any time — pulling `myapp:latest` today and tomorrow could give you two genuinely different images. A digest (`myapp@sha256:abc123...`) is an immutable, content-addressed identifier computed from the image's actual contents — it always refers to exactly one specific image, forever, and is what you should reference when reproducibility genuinely matters (production deployments, security-sensitive pinning).

How do ARG and ENV differ in a Dockerfile?

`ARG` defines a variable that only exists **during the image build** — it's not available in the running container at all unless explicitly also set as an `ENV`. `ENV` defines a variable that's baked into the image and persists into every container started from it, available to the application at runtime. Use `ARG` for build-time configuration (a version number to install, a build-target flag) and `ENV` for anything the running application itself needs to read.

Images, Dockerfile, and Builds

Writing effective Dockerfiles, understanding layer caching, multi-stage builds, and image tagging.

Difficulty

Open as page

A representative Dockerfile

FROM node:20-slim
WORKDIR /app
ENV NODE_ENV=production
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
COPY . .
EXPOSE 3000
USER node
CMD ["node", "server.js"]

Instruction by instruction

FROM — every Dockerfile starts with a base image to build on top of; this determines the starting filesystem layers and OS/runtime foundation everything else adds to.
RUN — executes a command at build time, and commits its filesystem changes as a new image layer (see the layer caching question) — used for installing packages, compiling code, or any other build-time setup.
COPY — copies files/directories from the build context (the directory you run docker build from) into the image's filesystem.
WORKDIR — sets the working directory for all subsequent RUN, CMD, COPY, etc. instructions — functionally similar to running cd, but persists across instructions and creates the directory if it doesn't exist.
ENV — sets an environment variable that persists into the running container. It is visible to the application at runtime, not just during the build. This is distinct from ARG, which only exists during the build (see that question).
EXPOSE — purely documentation/metadata. It tells anyone reading the Dockerfile (and tools like docker network) which port(s) the containerized application listens on, but it doesn't actually publish or open that port to the host. You still need -p on docker run to actually map it (see the networking topic).
USER — sets which user subsequent instructions run as, and which user the final container's main process runs as by default — critical for the security practice of not running as root (see the security topic).
CMD and ENTRYPOINT — both define what actually executes when a container starts, with an important behavioral distinction covered in the next question.

Additional instructions worth knowing

ARG BUILD_VERSION=dev          # build-time-only variable (see the ARG vs ENV question)
LABEL maintainer="team@example.com"   # arbitrary metadata attached to the image
VOLUME /data                    # documents/declares a mount point (see the storage topic)
HEALTHCHECK CMD curl -f http://localhost/health || exit 1   # see the container lifecycle topic

Why instruction order matters beyond just readability

Each instruction that touches the filesystem (RUN, COPY, ADD) creates a new cached layer. Docker's build cache invalidates from the point of the first changed instruction onward. Every instruction after that point must be re-executed, even if its own inputs didn't change. A Dockerfile is really executable documentation of how to build and run the application. This ordering sensitivity is exactly why it deserves the same deliberate structure as any other piece of code, not whatever order felt natural while developing.

Related Resources

Dockerfile reference

Open as page

COPY — simple, explicit, predictable

COPY package.json package-lock.json ./
COPY src/ ./src/

Copies files or directories from the build context straight into the image's filesystem, with no additional behavior — what you see is exactly what happens.

ADD — COPY, plus automatic extraction and remote URL fetching

# ADD automatically extracts a LOCAL tar archive into the destination
ADD myapp.tar.gz /app/

# ADD can fetch directly from a URL
ADD https://example.com/config.json /app/config.json

The first example is the behavior that most often surprises people. If the source is a recognized local archive format (.tar, .tar.gz, .tar.bz2, etc.), ADD automatically unpacks it into the destination directory. COPY would instead just copy the compressed archive file itself, unextracted. This implicit "maybe it extracts, maybe it doesn't, depending on file type" behavior is exactly what Docker's own documentation calls out as a source of confusion.

Why COPY is the recommended default

Predictability — COPY's behavior is a single, simple operation with no hidden conditional logic based on file type.
ADD's remote-URL fetching is generally discouraged — fetching a remote file directly in a Dockerfile instruction means the build result depends on the state of a URL outside your control at build time, which is worse for reproducibility. The fetched file also isn't automatically cleaned up if it's only needed transiently. A RUN curl ... && ... in the same layer, or a multi-stage build, gives more explicit control over this.
ADD's auto-extraction is only useful in one specific scenario: unpacking a local tarball as part of assembling the image. This is legitimate, but narrow enough that it's worth reaching for ADD deliberately for that one purpose, rather than defaulting to it out of habit. A remote URL fetch in particular is better done as an explicit RUN curl/wget, or a multi-stage build step that verifies the artifact, than via ADD's implicit behavior.

Related Resources

Dockerfile reference: ADD and COPY

Open as page

CMD alone — a default command, easily overridden

CMD ["node", "server.js"]

docker run myapp                  # runs: node server.js
docker run myapp node debug.js     # OVERRIDES the entire CMD -- runs: node debug.js instead

Any arguments given after the image name on docker run completely replace the CMD. This makes CMD alone appropriate when the image is meant to be flexible about what it runs — a general-purpose base image, or a development image where you might want to run a shell or a different script for debugging.

ENTRYPOINT alone — a fixed command that always runs

ENTRYPOINT ["node", "server.js"]

docker run myapp                    # runs: node server.js
docker run myapp --port=9000         # runs: node server.js --port=9000 (appended as ARGS, not a replacement)

Arguments given at docker run are appended to the ENTRYPOINT, not used to replace it. This makes ENTRYPOINT appropriate when the image should always run one specific thing no matter what. It essentially makes the container behave like a fixed, dedicated executable.

Combining both — the standard, recommended pattern

ENTRYPOINT ["node"]
CMD ["server.js"]

docker run myapp                # runs: node server.js       (CMD's default argument used)
docker run myapp debug.js        # runs: node debug.js        (CMD's default OVERRIDDEN, but still passed to ENTRYPOINT)

This gives you the best of both: ENTRYPOINT fixes what program runs (node, always), while CMD provides a sensible default argument to it. That default argument is still easy to override for a one-off different invocation, without needing to override the entire command.

Exec form vs. shell form — a critical, easy-to-miss distinction

# Exec form (recommended): runs the command DIRECTLY, no shell involved
CMD ["node", "server.js"]

# Shell form: runs the command wrapped in "/bin/sh -c ..."
CMD node server.js

The exec form (JSON array syntax) runs the specified program directly as PID 1 inside the container. Signals like SIGTERM (sent by docker stop) go straight to it, allowing graceful shutdown handling. The shell form instead runs /bin/sh -c "node server.js". The shell itself becomes PID 1, and it's the shell's responsibility to forward signals to the actual application process underneath it — a responsibility it doesn't always fulfill correctly. This is a common, subtle cause of containers that don't shut down gracefully: they ignore SIGTERM and only stop after docker stop's timeout forces a SIGKILL.

Scenario	Recommended setup
Fixed, purpose-built application container	`ENTRYPOINT` + `CMD` (overridable default args)
General-purpose or dev image, command often replaced entirely	`CMD` alone
Either form, always	Exec (JSON array) syntax, for correct signal handling

Related Resources

Dockerfile reference: ENTRYPOINT and CMD

Open as page

How the cache decides whether to reuse a layer

FROM node:20-slim              # Layer A
WORKDIR /app                    # Layer B
COPY package.json ./             # Layer C -- cache key includes package.json's actual content
RUN npm install                   # Layer D -- cache key includes the PRECEDING layer + this instruction's text
COPY . .                            # Layer E -- cache key includes the content of every copied file

For each instruction, Docker computes a cache key based on the preceding layer plus that instruction's own inputs. For RUN, that's the literal command text. For COPY/ADD, it's the actual file contents being copied, not just their names. So even a single-character change inside package.json invalidates Layer C and, since caching is sequential, everything after it too.

Why this makes rebuilds fast — when structured well

# First build: everything builds from scratch
docker build -t myapp .
# ... (30 seconds, say, mostly spent on `npm install`)

# Change only application code (not package.json), rebuild:
docker build -t myapp .
# Layer A, B, C, D all CACHE HIT (package.json unchanged, so npm install's inputs are identical)
# Only Layer E (COPY . .) and anything after it actually re-executes
# ... (2 seconds)

Because npm install (often one of the slowest steps) sits before the COPY . . that brings in frequently-changing application code, changing application code alone doesn't invalidate the expensive dependency-installation layer at all. This is the single most impactful Dockerfile optimization technique, covered in more depth in the cache-ordering question.

Why cache invalidation cascades forward, never backward

FROM node:20-slim
COPY package.json ./     # Layer C
RUN npm install            # Layer D
COPY . .                     # Layer E  <- if THIS changes, only E (and anything after) rebuilds
                              #             C and D are unaffected, since their own inputs didn't change

If instead package.json changes, Layer C invalidates, and every layer from C onward (D, E) must rebuild too. This happens even though Layer E's own inputs (the application code) might not have changed at all. This "cascades forward from the first change" rule is why placing rarely-changing, expensive instructions (dependency installation) before frequently-changing ones (application code) is so consistently valuable. It maximizes how often the expensive early layers get to reuse the cache.

Sharing cache and layers across images, not just across builds of the same image

Layers are content-addressed and stored once on a given host. So two entirely different images that happen to share an identical layer (e.g., both FROM node:20-slim, with no differences up to some point) genuinely share that stored layer on disk — not just conceptually, but as literally the same data. This saves both disk space and pull time when a machine already has one image with a shared base layer and pulls another.

Cache-busting techniques when you deliberately want to skip the cache

docker build --no-cache -t myapp .        # ignore the cache entirely for this build

This is occasionally necessary when a RUN instruction's effects depend on something outside its literal text or copied files. For example, in RUN apt-get update && apt-get install -y curl, the actual packages fetched can change over time even though the instruction's text never does. This is a common, subtle source of "why did my rebuild not pick up the latest security patches" confusion. The cache has no way to know that an identical-looking instruction might now behave differently against a changed remote package repository.

Related Resources

Docker: Build Cache

Open as page

The anti-pattern: copying everything before installing dependencies

# BAD ORDERING
FROM node:20-slim
WORKDIR /app
COPY . .                  # copies EVERYTHING, including source code that changes constantly
RUN npm install             # this layer's cache key now depends on the ENTIRE copied tree
CMD ["node", "server.js"]

With this ordering, changing any file in the project invalidates the COPY . . layer. This is true even for a single comment in an unrelated source file that has nothing to do with dependencies. Invalidating the COPY . . layer in turn invalidates the npm install layer right after it, since its cache key depends on the preceding layer. Every single build then re-runs the full dependency installation from scratch, even though the actual dependency list (package.json) hasn't changed at all. This is a slow, entirely avoidable rebuild on every code change.

The fix: copy the dependency manifest first, install, then copy the rest

# GOOD ORDERING
FROM node:20-slim
WORKDIR /app
COPY package.json package-lock.json ./    # only the dependency manifest -- changes rarely
RUN npm ci                                  # cached, as long as the manifest hasn't changed
COPY . .                                      # application code -- changes constantly, but
                                                # this is now the LAST filesystem-changing step
CMD ["node", "server.js"]

Now, changing application code only invalidates the final COPY . . layer. The npm ci layer, which is often much slower, stays cached as long as package.json/package-lock.json haven't changed. This is the common case for most day-to-day commits.

The general principle, stated once

Order instructions from least-likely-to-change to most-likely-to-change. System package installation and dependency installation (driven by a lockfile that changes relatively rarely) belong early; application source code (which changes on nearly every commit) belongs as late as possible.

FROM python:3.12-slim
RUN apt-get update && apt-get install -y libpq-dev      # rarely changes
COPY requirements.txt .                                     # changes occasionally
RUN pip install -r requirements.txt                           # cached unless requirements.txt changes
COPY . .                                                        # changes on every commit -- last
CMD ["python", "app.py"]

Combining related RUN instructions to control layer granularity

# Creates two separate layers, and (more importantly) leaves package-manager
# cache/lists behind in the FIRST layer even after the second layer "removes" them,
# since removal in a later layer doesn't shrink an earlier, already-committed layer
RUN apt-get update
RUN apt-get install -y curl && rm -rf /var/lib/apt/lists/*

# Better: combine into ONE layer so cleanup actually reduces that layer's size
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

Since each layer is immutable once committed, "deleting" a file in a later layer doesn't reclaim the space that file used in an earlier layer. It just hides that file from the merged view (recall the union filesystem question). Combining install-then-cleanup into a single RUN instruction ensures the cleanup actually shrinks that one resulting layer, rather than leaving bloat in an earlier layer that a later layer merely masks.

A quick self-check for any existing Dockerfile: change one line of application code, rebuild, and ask what the minimum set of layers should have needed to re-execute. If the actual rebuild touches an expensive dependency-installation step that has nothing to do with that change, the ordering has room to improve. This is often the difference between a multi-minute rebuild and one that takes a couple of seconds.

Related Resources

Docker: Build Cache Invalidation

Open as page

The problem: build tools bloat the final image

# Single-stage build -- the final image includes EVERYTHING used to build it
FROM golang:1.22
WORKDIR /app
COPY . .
RUN go build -o server .
CMD ["./server"]

This works, but the resulting image includes the entire Go toolchain: the compiler, standard library source, and build caches. That adds up to hundreds of megabytes, even though the running application, once compiled, is just a single, small, statically-linked binary. None of that build tooling is needed at runtime. It is not even wanted, since it also increases the attack surface.

The multi-stage solution

# Stage 1: "builder" -- has the full toolchain, produces the compiled binary
FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN go build -o server .

# Stage 2: the FINAL image -- minimal, only what's needed to RUN the binary
FROM alpine:3.19
COPY --from=builder /app/server /usr/local/bin/server
CMD ["server"]

The COPY --from=builder instruction reaches back into the first stage's filesystem and copies out just the compiled server binary. None of the Go compiler, source code, or build-time dependencies from the builder stage make it into the final image at all. The final image can be a tiny base — even scratch, an entirely empty base image, for a fully static binary with no runtime dependencies. This often shrinks the final image from hundreds of megabytes down to tens of megabytes or less.

Multiple intermediate stages

FROM node:20 AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci

FROM node:20 AS build
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM nginx:alpine AS final
COPY --from=build /app/dist /usr/share/nginx/html

Stages can be named (AS deps, AS build) and referenced by name in later COPY --from= instructions. This is useful for separating concerns — installing dependencies vs. building vs. the final runtime image — even when the language or runtime doesn't produce a single standalone compiled binary the way Go does. Note that this example's final stage even uses a completely different base image (nginx:alpine) than the build stages (node:20). The final stage just needs to serve the already-built static files, with no Node.js runtime required at all.

Why this matters beyond just image size

Reduced attack surface — a smaller final image with no compilers, build tools, or source code present means fewer things for a compromised container to exploit or exfiltrate (see the security topic).
Faster pulls and deployments — a smaller image transfers faster across the network to every node that needs to run it, meaningfully speeding up deployments and autoscaling events at real scale.
A single Dockerfile, still — before multi-stage builds existed, achieving this same "build in one environment, run in a minimal one" pattern required a different approach. One option was two separate Dockerfiles, with manual artifact copying between them via a shared volume or a script. Another option was building outside Docker entirely and then COPYing a pre-built artifact in. Both approaches are more awkward and error-prone than expressing the whole pipeline declaratively in one file.

Related Resources

Docker: Multi-stage builds

Open as page

Full images — maximum compatibility, largest size

FROM node:20        # based on a full Debian userland

Includes a complete set of common OS utilities, shells, and libraries. Most software "just works" without unexpected missing-dependency surprises. Debugging is also straightforward, since every common tool is available (for example, docker exec -it container bash). The cost is size — often several hundred megabytes even before your application's own dependencies are added.

Slim images — a leaner version of the same distribution

FROM node:20-slim    # same Debian base, but with docs, extra utilities, etc. stripped out

Meaningfully smaller than the full variant, while still using the same underlying package ecosystem (glibc, apt). This means very few compatibility surprises, since it's the same distribution family, just trimmed down. A good, low-risk default for most applications wanting a size improvement without changing the underlying C library or package manager.

Alpine images — smallest practical general-purpose option, different internals

FROM node:20-alpine   # based on Alpine Linux -- musl libc, BusyBox, apk package manager

Alpine Linux is built around musl libc (not glibc) and BusyBox (a single compact binary providing minimal versions of many standard Unix utilities), rather than the GNU toolchain. This combination is what makes Alpine-based images dramatically smaller, often 5-10x smaller than the equivalent Debian-based image. The tradeoff: some software can behave subtly differently, or fail to build or run correctly, against musl. This is particularly true for anything with native compiled dependencies, or software that makes assumptions specific to glibc's behavior. This compatibility risk is real and occasionally time-consuming, so it should be tested for rather than assumed away.

Distroless images — no shell, no package manager, minimal attack surface

FROM gcr.io/distroless/nodejs20-debian12

Google's distroless images strip out everything not strictly needed to run the application — no shell, no package manager, no text editors, not even basic Unix utilities like ls or cat. This gives the smallest possible attack surface: an attacker who compromises the running application has no shell to pivot into and no package manager to install further tools with. But it comes at a real operational cost. You cannot docker exec -it container sh into a distroless container at all, since there's no shell binary present. Debugging requires different techniques instead, such as attaching ephemeral debug containers alongside it (similar to the Kubernetes kubectl debug pattern), or relying entirely on external logging and observability rather than interactive investigation.

Comparing the tradeoffs

	Full	Slim	Alpine	Distroless
Relative size	Largest	Smaller	Smallest (general-purpose)	Smallest (no shell/tools)
C library	glibc	glibc	musl	Varies (often glibc-based)
Compatibility risk	Lowest	Low	Moderate (musl-specific issues)	Low (same libc, just missing tools)
Interactive debugging	Easiest	Easy	Easy	Not possible (no shell)
Attack surface	Largest	Smaller	Smaller	Smallest

Slim is a safe, low-risk default for most applications. Alpine is worth it once you've actually verified (not assumed) musl compatibility — a real concern for compiled languages with native extensions. Distroless earns its debugging cost specifically once solid centralized logging already reduces how often an interactive shell would be needed anyway. Full images mostly belong in local development, where the debugging convenience outweighs the size/security cost.

Related Resources

Google: Distroless Images

Open as page

What the build context actually is

docker build -t myapp .

That trailing . isn't just "look at the Dockerfile here." It specifies the build context: the entire directory tree that Docker packages up and sends to the daemon before the build even starts. This allows any COPY/ADD instruction in the Dockerfile to reference files from it. This matters even for a remote daemon (see the CLI/daemon question). The whole context genuinely gets transferred over the network to wherever the daemon is running — it is not just referenced by path.

Sending build context to Docker daemon  245.7MB

This line is printed at the start of every docker build, and it deserves attention. A surprisingly large number here is a strong signal that unnecessary files are being included and transferred needlessly — for example, an entire node_modules directory, a .git history, or large data files unrelated to the application.

.dockerignore — excluding files from the context

# .dockerignore
.git
node_modules
npm-debug.log
.env
*.md
Dockerfile
.dockerignore

This functions similarly to .gitignore. Patterns listed here are excluded from what gets sent as the build context at all. This means they are not just "not copied into the image" — they are genuinely never transmitted to the daemon in the first place.

Why this matters for more than just speed

Performance: a smaller context transfers faster. This is especially relevant for a remote daemon or CI environment. It also reduces the daemon's own overhead from scanning and hashing the context for cache purposes.

Avoiding accidental inclusion of sensitive data: a broad COPY . . instruction copies everything in the build context that isn't excluded. Without a .dockerignore excluding .env, a local secrets file, or .git, a careless COPY . . can bake credentials directly into an image layer. This is especially risky for .git, since it can contain historical commits with sensitive data even if the current working tree doesn't. Image layers are effectively permanent once built and pushed. Removing a file in a later layer doesn't remove it from the earlier layer's stored data (recall the union filesystem question). Because of this, a leaked secret baked into an early layer is extremely difficult to fully scrub, even if a later commit or layer "deletes" it.

COPY . .    # without a .dockerignore, this could copy .env, .git, and other sensitive/unnecessary files

Avoiding cache invalidation from irrelevant files: recall that COPY's cache key depends on the actual content of copied files (see the layer caching question). If .git or build artifacts are part of the context and happen to change on every build, even without meaningful application changes, they can cause unnecessary cache invalidation for instructions that copy broad directories. An unexpectedly large "Sending build context" number should be treated as a signal to check the .dockerignore, not ignored.

Related Resources

Docker: .dockerignore file

Open as page

Tags are mutable pointers, not permanent identifiers

docker pull node:20
# some time later, after the maintainers push a new patch-level build of node:20...
docker pull node:20
# this can silently give you a DIFFERENT image than before, with the same tag

A tag is just a label that an image publisher chooses to attach. Nothing stops them from later re-pushing a different image under the exact same tag, and this happens routinely and legitimately. For example, node:20 is regularly updated to include the latest patch releases and security fixes within the Node 20 line, all under that same tag. latest is the most extreme example of this, and also the most commonly misused. It is just a conventional tag name, with no inherent guarantee of being the "most recent" or "most stable" anything. It is simply whatever the publisher most recently tagged as latest.

Digests are immutable, content-addressed identifiers

docker pull node:20
docker inspect node:20 --format='{{index .RepoDigests 0}}'
# node@sha256:a1b2c3d4e5f6...

docker pull node@sha256:a1b2c3d4e5f6...

A digest is a cryptographic hash computed from the image's actual manifest and content. Pulling by digest always gives you the exact same bytes, forever, since any change to the image's content would produce a different hash entirely. Two different tags can point at the same digest (if they happen to reference an identical image), but a single digest can never refer to two different images.

Why this distinction matters for reproducibility

# Fragile: this could resolve to a DIFFERENT image tomorrow than it did today
image: myapp:1.0

# Fully reproducible: this ALWAYS refers to the exact same image, forever
image: myapp@sha256:a1b2c3d4e5f6...

Sometimes you need a guarantee that "this exact same image is what's running everywhere, every time, with certainty." Examples include a critical production deployment, a security audit that needs to confirm exactly what code is running, or a supply-chain-security pinning requirement. In these cases, a mutable tag alone doesn't provide that guarantee, even a specific version-looking tag like 1.0.3. This is because nothing technically prevents someone from re-pushing different content under that same tag later.

The practical middle ground most teams use

image: myapp:1.0.3@sha256:a1b2c3d4e5f6...

Combining both gives you readability and reproducibility simultaneously: a human-readable tag for clarity when a person is looking at the manifest, plus the digest for the actual immutable guarantee the deployment relies on. Many CI/CD pipelines automatically resolve and pin the digest at build or deploy time for this reason. This means a human never has to hand-type a long hash, but the actual deployed reference is still digest-pinned underneath. In particular, latest should never appear in a production deployment manifest at all, since it actively obscures which version is running and offers zero reproducibility guarantee.

Related Resources

Docker: Image Digests

Open as page

ARG — build-time only, not present in the running container

ARG NODE_VERSION=20
FROM node:${NODE_VERSION}-slim

ARG BUILD_ENV=production
RUN if [ "$BUILD_ENV" = "development" ]; then npm install; else npm ci --omit=dev; fi

docker build --build-arg NODE_VERSION=18 --build-arg BUILD_ENV=development -t myapp .

ARG values are supplied at build time (via --build-arg, or a default in the Dockerfile itself), and are only accessible during the build. Once the image is built and a container starts from it, none of these ARG values are present in the running container's environment at all, unless you deliberately also expose them via ENV (see below).

ENV — persists into the running container

ENV NODE_ENV=production
ENV PORT=3000

docker run myapp env | grep NODE_ENV
# NODE_ENV=production

ENV values are baked into the image itself and are automatically present as environment variables in every container started from that image. This is what the running application actually reads via its normal environment-variable access (process.env.NODE_ENV in Node, os.environ in Python, etc.).

Bridging the two: using an ARG to set a default ENV value

ARG APP_VERSION=dev
ENV APP_VERSION=${APP_VERSION}

docker build --build-arg APP_VERSION=2.1.0 -t myapp .
docker run myapp env | grep APP_VERSION
# APP_VERSION=2.1.0

This pattern lets a build-time value (perhaps injected by CI, reflecting the actual git tag/commit being built) become available to the running application at runtime too. Without this explicit bridging, an ARG's value would be invisible to the container once it's actually running. This is the detail most likely to cause confusion.

Why ARG values should never hold secrets

# NEVER do this
ARG API_KEY
RUN curl -H "Authorization: Bearer $API_KEY" https://example.com/fetch-something

Even though ARG values aren't automatically present in the running container's environment, they are recorded in the image's build history and are visible to anyone who can inspect the image. docker history can reveal build-time ARG/command details, depending on how they were used. Passing secrets via ARG is a common, real security mistake, since the value can end up baked into a layer's metadata even though it's not "in the environment" in the way ENV values are. Genuine build-time secrets (a private package registry token needed only during npm install) should use Docker's dedicated build secrets mechanism instead (RUN --mount=type=secret, part of BuildKit). This mechanism is specifically designed to make a secret available only during one RUN instruction's execution, without persisting it in any layer or image metadata at all.

Related Resources

Dockerfile reference: ARG