Security

Difficulty

The default, risky behavior

FROM node:20-slim
WORKDIR /app
COPY . .
CMD ["node", "server.js"]      # runs as ROOT by default -- no USER instruction specified

Without an explicit USER instruction, most base images default their main process to running as root (UID 0) inside the container. This isn't automatically catastrophic. The container's root is still confined by its namespace, and with default settings it doesn't have direct root access to the host. But it meaningfully raises the stakes of anything going wrong. An attacker who achieves arbitrary code execution inside a root-running container has unrestricted access to everything within that container: every file, every process, the ability to install anything. The same compromise inside a non-root container is confined to whatever that specific, limited user account can actually do.

Enforcing non-root with the USER instruction

FROM node:20-slim
WORKDIR /app
COPY --chown=node:node . .
USER node                # many official images already include a pre-created, unprivileged user
CMD ["node", "server.js"]

Many official base images (like node) already include a pre-created, unprivileged user for this purpose, conventionally also named after the runtime (like node). Using USER node switches to it. From that point in the Dockerfile onward, the container's main process, and anything it forks, runs without root privileges.

Creating your own non-root user, for images that don't provide one

FROM alpine:3.19
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

For base images without a suitable pre-existing user, creating a dedicated, minimal, non-privileged user explicitly (with no unnecessary permissions or shell access beyond what's needed) is the standard practice.

Enforcing this at runtime, independent of what the image itself declares

docker run --user 1000:1000 myapp:1.0

The --user flag on docker run (or the equivalent securityContext.runAsUser/runAsNonRoot in Kubernetes — see that stack's SecurityContext question) can force a specific non-root UID, even for an image that would otherwise default to root. This provides an additional, deployment-time layer of enforcement that doesn't rely solely on the image's own Dockerfile having done the right thing.

Why this is worth enforcing even though container isolation exists

Namespaces and cgroups (see the fundamentals topic) provide real isolation, but they are not an absolute security boundary. Container escape vulnerabilities are not routine, but they do periodically get discovered. Running as root inside the container is precisely the condition that makes many such escapes more dangerous or more likely to succeed. Several known escape techniques specifically rely on the compromised process already having root privileges within its own namespace as a stepping stone. Running as a genuinely unprivileged, non-root user is a foundational defense-in-depth measure. It doesn't eliminate the risk of a container escape, but it substantially narrows what an attacker can do both before and during an escape attempt.

The broader principle this connects to

This is the same least-privilege principle that runs throughout this stack's other security questions (Linux capabilities, the Docker socket risk, RBAC in the Kubernetes stack). Grant only the minimum privilege actually needed, so that any single compromise's blast radius is as limited as possible. Don't assume a compromise will never happen and skip limiting its impact if it does. A quick image-review check worth making habitual: docker inspect --format='{{.Config.User}}' myapp:1.0 should show something other than empty/root.

Why capabilities exist: breaking up "root" into pieces

Traditionally, Unix permission checking was binary — a process either ran as root (with unrestricted power to do essentially anything on the system) or as a regular user (subject to normal permission checks). Linux capabilities split root's traditionally monolithic power into dozens of distinct, individually grantable privileges — CAP_NET_BIND_SERVICE (bind to a port below 1024), CAP_CHOWN (change file ownership arbitrarily), CAP_SYS_ADMIN (a broad, catch-all set of administrative operations), CAP_SYS_MODULE (load kernel modules), and many others. This lets a process be granted just the specific slice of "root-like" power it actually needs, rather than all of it or none of it.

Docker's default capability set — already more restricted than full root

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp:1.0

Even a container explicitly running as root does not receive the full set of Linux capabilities that a genuinely privileged host root process would have, by Docker's own defaults. Docker grants a modest default subset — things like CAP_CHOWN, CAP_NET_BIND_SERVICE, CAP_SETUID/CAP_SETGID, and a handful of others generally needed by typical applications — while explicitly excluding more dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE, and others) by default. This is itself a form of built-in least-privilege enforcement, independent of whether the container's process is technically "root" or not.

Dropping capabilities further — reaching for true minimalism

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp:1.0

--cap-drop=ALL removes every capability, including Docker's own default set. --cap-add then selectively re-adds back only the specific ones the container genuinely needs — in this example, just NET_BIND_SERVICE (needed if the application binds to a port below 1024). Most typical application containers can run with most or all capabilities dropped entirely, especially ones that don't need to bind to a privileged port or perform any genuinely system-level operations. They were never actually using the majority of Docker's already-modest default set.

# In Kubernetes, this maps directly onto SecurityContext (see that stack's question)
securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]

This is the exact same mechanism, and the exact same recommended pattern (drop everything, add back only what's genuinely needed), covered in the Kubernetes stack's SecurityContext question. Kubernetes's capability configuration is, at the implementation level, just configuring this identical Linux kernel feature.

Why this matters: limiting what a compromised container can actually do

If an attacker achieves code execution inside a container, the specific set of capabilities that container holds directly determines what kinds of system-level actions they can attempt next. A container with CAP_SYS_ADMIN retained gives an attacker access to a wide range of potentially escape-relevant administrative operations. The same compromise inside a container that's dropped every capability except the one or two it genuinely needs gives the attacker dramatically less to work with, even before considering any other layer of defense.

Identifying which capabilities an application actually needs

This requires either consulting the application or base image's documentation, or empirically testing it — running with --cap-drop=ALL and incrementally adding back capabilities one at a time until the application works correctly. There's no universal answer, since it depends entirely on what the specific application actually does, such as binding to privileged ports or manipulating file ownership. Once identified, that minimal set becomes a relatively low-effort, high-leverage hardening step for any production container, mirroring the same least-privilege discipline this stack applies elsewhere.

What mounting the socket actually does

docker run -v /var/run/docker.sock:/var/run/docker.sock some-tool

This bind-mounts the host's Docker daemon's Unix socket directly into the container. Any process inside that container can now send requests to the host's real Docker daemon, exactly as if it were running the docker CLI directly on the host itself. (Recall from the fundamentals topic that the CLI is just a thin client talking to this exact socket.)

Why this is equivalent to host root access

# From INSIDE a container that has the Docker socket mounted:
docker run -v /:/host --privileged alpine chroot /host sh

The daemon reachable through that socket can create and start new containers with essentially arbitrary configuration. This includes mounting the host's entire root filesystem into a new container, or running with --privileged, which disables most container isolation protections entirely. Because of this, a process with access to the Docker socket can trivially use it to escape any container boundary altogether and gain full read/write access to the host's filesystem, processes, and everything else. This isn't a theoretical edge case. It's a well-known, straightforward technique. That's exactly why "Docker socket access" is treated as functionally equivalent to root on the host in serious security analysis, regardless of what privileges the container holding that socket mount otherwise appears to have.

Why this pattern exists anyway, despite the risk

# A common (risky) pattern: letting a CI runner or monitoring tool
# manage OTHER containers by talking to the host's Docker daemon
services:
  ci-runner:
    image: my-ci-runner
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

This pattern shows up frequently in CI/CD tooling (a containerized CI runner that itself needs to build and run other containers as part of its job — "Docker-in-Docker" concerns, covered in the production topic), in monitoring/management tools that need visibility into other running containers, and in various developer convenience setups. It's a genuinely common, real-world pattern — which is exactly why understanding its risk, rather than treating it as an unremarkable convenience, matters.

Mitigations, if you must use this pattern

  • A read-only, filtered proxy in front of the socket. Tools like docker-socket-proxy sit between the container and the real socket, allowing only a specific, restricted subset of Docker API operations — for example, letting a monitoring tool list and inspect containers while blocking its ability to create new ones. This meaningfully reduces, though it doesn't eliminate, the risk compared to raw, unrestricted socket access.
  • Avoid it entirely for untrusted or lower-trust workloads — never mount the Docker socket into a container running code you don't fully trust, or that's exposed to any kind of external/user-supplied input that could lead to arbitrary command execution within it.
  • Prefer purpose-built alternatives where they exist. For CI/CD specifically, a genuinely separate, isolated build environment avoids the risk entirely rather than mitigating it. This could be a dedicated build VM, or Kubernetes-native build tooling like Kaniko that can build images without needing a full Docker daemon socket at all.

The broader lesson: "just mount the socket" is never a low-stakes convenience

This is a good, concrete illustration of a recurring security theme across this entire stack. A mechanism that seems like a simple convenience — giving one container the ability to manage others — can quietly grant vastly more power than intended. This is because the underlying capability, talking to the Docker daemon, doesn't have a narrower "just manage containers, not the host" mode by default. Recognizing this specific risk, and being able to explain precisely why socket access equals host root rather than just that "it's risky," is a strong signal of genuine container security understanding. It's a widely used pattern that's frequently under-appreciated for how dangerous it actually is.

seccomp — restricting which system calls are even allowed

Every action a program takes that involves the kernel (opening a file, creating a socket, forking a process) goes through a system call (syscall). seccomp (secure computing mode) lets you define a filter specifying exactly which syscalls a process is allowed to make. Any syscall not on the allowed list is blocked outright — typically causing the calling process to receive an error, or be killed, depending on configuration — regardless of what file permissions or capabilities might otherwise seem to allow.

docker run --security-opt seccomp=default.json myapp:1.0

Docker actually applies a default seccomp profile automatically, blocking around 44 of the roughly 300+ available Linux syscalls. This targets syscalls that are rarely needed by typical containerized applications but have historically been associated with container escapes or kernel-level exploits — things like kexec_load, various rarely-needed namespace/mount-manipulation syscalls, and others. Most applications never notice this default restriction at all, since they simply never call the blocked syscalls in normal operation.

docker run --security-opt seccomp=unconfined myapp:1.0    # disables seccomp filtering entirely -- generally a bad idea

Disabling seccomp entirely (unconfined) removes this layer of protection. This is occasionally necessary for specialized workloads that genuinely need a normally-blocked syscall, such as certain low-level debugging or tracing tools, or some specialized networking software. But this should be a deliberate, narrow exception, not a default reached for just to make an error message go away without understanding why it occurred.

AppArmor / SELinux — mandatory access control beyond syscall filtering

Where seccomp restricts which syscalls can be made at all, AppArmor (common on Ubuntu/Debian-based systems) and SELinux (common on RHEL/Fedora-based systems) restrict what a process can actually do with the syscalls it's allowed to make. This includes which specific files it can read or write, what network operations it can perform, and which capabilities it can use, based on a named security profile applied to the process.

docker run --security-opt apparmor=docker-default myapp:1.0

Docker applies a default AppArmor profile automatically on systems where AppArmor is available, similarly restricting a range of higher-risk operations by default without requiring any explicit configuration from the person running the container.

How these layers relate to namespaces, cgroups, and capabilities

Namespaces:    controls what a process can SEE (isolation)
cgroups:        controls how much a process can USE (resource limits)
Capabilities:   controls WHICH root-like privileges a process has, if any
seccomp:         controls WHICH SYSTEM CALLS a process can make at all
AppArmor/SELinux: controls WHAT a process can DO with specific files/resources/capabilities

These are genuinely complementary, layered defenses. A container could pass a resource-usage cgroup check, and be running as a properly non-root user with capabilities already dropped (see those questions). It could still benefit from an additional seccomp/AppArmor layer that specifically blocks syscalls or file access that shouldn't be reachable at all, in case some other assumption in the chain turns out to be wrong. This layering is a textbook example of defense in depth — no single mechanism is assumed to be perfectly sufficient on its own.

Why most users never think about these layers explicitly

Docker applies sensible seccomp and AppArmor defaults automatically, without requiring explicit configuration for the common case. This is exactly why many practitioners aren't aware these protections are active at all. "Just running in a container" already provides meaningfully more restriction than "just running as a regular host process," for exactly this reason. The defaults are worth leaving in place for the overwhelming majority of workloads. Disabling either layer (unconfined) should be treated as a deliberate, narrowly-scoped exception requiring real justification — never a default troubleshooting step for a confusing error.

Why baking secrets into an image is always wrong

# NEVER do any of these
ENV DB_PASSWORD=supersecret
COPY .env /app/.env
ARG API_KEY
RUN curl -H "Authorization: Bearer $API_KEY" https://example.com

Every one of these persists the secret's value inside the image's layers or build history — recoverable by anyone with access to the image (docker history, inspecting layer contents directly, or simply docker run and reading the baked-in ENV/file). Recall from the layer-caching question that layers are effectively permanent once built — even a later layer that appears to "remove" the secret doesn't actually erase it from the earlier layer's stored data. This applies even to ARG (see that question) — build arguments can still leave traces in image metadata/history even though they're not automatically present in the running container's environment.

Runtime injection — better, but still has caveats

docker run -e DB_PASSWORD="$DB_PASSWORD" myapp:1.0

This avoids baking the secret into the image itself, but plain environment variables have their own real exposure risks. They're visible to anything that can inspect the container's configuration (docker inspect), and visible in process listings on some systems (/proc/<pid>/environ). They also commonly end up accidentally logged — many applications and frameworks log their full environment at startup for debugging purposes, inadvertently capturing secrets in log output — or exposed via a crash dump or error-reporting tool that includes environment context.

A better runtime mechanism: mounted secret files

# Docker Swarm's native secrets mechanism
echo "supersecret" | docker secret create db_password -
docker service create --secret db_password myapp:1.0
# Inside the container, the secret is available as a FILE, not an environment variable:
cat /run/secrets/db_password
# supersecret

Docker Swarm's native secrets mechanism, and Compose's own secrets: key (which can source from Swarm secrets or, for non-Swarm local development, a plain file), deliver a secret as a file mounted into the container at a well-known path, rather than as an environment variable. This avoids several of environment variables' specific exposure risks, such as accidental logging of the full environment or visibility in some process-inspection tools, since the secret only exists as file content the application must explicitly choose to read.

# Compose secrets (non-Swarm, file-based)
services:
  api:
    secrets:
      - db_password
secrets:
  db_password:
    file: ./db_password.txt    # this file itself must never be committed to version control

Build-time secrets — for values only needed transiently during the build

RUN --mount=type=secret,id=npm_token \
    NPM_TOKEN=$(cat /run/secrets/npm_token) npm install
docker build --secret id=npm_token,src=./npm_token.txt -t myapp .

BuildKit's dedicated secret-mounting syntax (--mount=type=secret) makes a secret available only during that specific RUN instruction's execution, without it persisting in any built layer or the final image's history at all. This is the correct mechanism for a private package registry token or similar credential needed transiently just to complete a build step, closing exactly the gap the ARG-for-secrets anti-pattern leaves open.

External secrets managers — the strongest option for production

For genuinely sensitive production secrets, integrate with a dedicated secrets manager — HashiCorp Vault, AWS Secrets Manager, and similar (see the SQL/Databases and Kubernetes stacks' equivalent questions). The application can fetch secrets directly at runtime, or a sidecar/init pattern can inject them. This provides stronger audit trails, rotation, and centralized access control than any Docker-native mechanism alone offers.

Secret needRight mechanism
Build-time only (e.g. a private registry token for npm install)BuildKit --mount=type=secret
Runtime, simple setupSwarm/Compose native secrets (file-based)
Runtime, needs audit trail/rotation/centralized controlExternal secrets manager (Vault, AWS Secrets Manager)
NeverARG, COPY, or ENV baked into the image