What are Linux capabilities, and how do you drop unneeded ones from a container?

Detailed Answer

Why capabilities exist: breaking up "root" into pieces

Traditionally, Unix permission checking was binary — a process either ran as root (with unrestricted power to do essentially anything on the system) or as a regular user (subject to normal permission checks). Linux capabilities split root's traditionally monolithic power into dozens of distinct, individually grantable privileges — CAP_NET_BIND_SERVICE (bind to a port below 1024), CAP_CHOWN (change file ownership arbitrarily), CAP_SYS_ADMIN (a broad, catch-all set of administrative operations), CAP_SYS_MODULE (load kernel modules), and many others. This lets a process be granted just the specific slice of "root-like" power it actually needs, rather than all of it or none of it.

Docker's default capability set — already more restricted than full root

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp:1.0

Even a container explicitly running as root does not receive the full set of Linux capabilities that a genuinely privileged host root process would have, by Docker's own defaults. Docker grants a modest default subset — things like CAP_CHOWN, CAP_NET_BIND_SERVICE, CAP_SETUID/CAP_SETGID, and a handful of others generally needed by typical applications — while explicitly excluding more dangerous ones (CAP_SYS_ADMIN, CAP_SYS_MODULE, CAP_SYS_PTRACE, and others) by default. This is itself a form of built-in least-privilege enforcement, independent of whether the container's process is technically "root" or not.

Dropping capabilities further — reaching for true minimalism

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp:1.0

--cap-drop=ALL removes every capability, including Docker's own default set. --cap-add then selectively re-adds back only the specific ones the container genuinely needs — in this example, just NET_BIND_SERVICE (needed if the application binds to a port below 1024). Most typical application containers can run with most or all capabilities dropped entirely, especially ones that don't need to bind to a privileged port or perform any genuinely system-level operations. They were never actually using the majority of Docker's already-modest default set.

# In Kubernetes, this maps directly onto SecurityContext (see that stack's question)
securityContext:
  capabilities:
    drop: ["ALL"]
    add: ["NET_BIND_SERVICE"]

This is the exact same mechanism, and the exact same recommended pattern (drop everything, add back only what's genuinely needed), covered in the Kubernetes stack's SecurityContext question. Kubernetes's capability configuration is, at the implementation level, just configuring this identical Linux kernel feature.

Why this matters: limiting what a compromised container can actually do

If an attacker achieves code execution inside a container, the specific set of capabilities that container holds directly determines what kinds of system-level actions they can attempt next. A container with CAP_SYS_ADMIN retained gives an attacker access to a wide range of potentially escape-relevant administrative operations. The same compromise inside a container that's dropped every capability except the one or two it genuinely needs gives the attacker dramatically less to work with, even before considering any other layer of defense.

Identifying which capabilities an application actually needs

This requires either consulting the application or base image's documentation, or empirically testing it — running with --cap-drop=ALL and incrementally adding back capabilities one at a time until the application works correctly. There's no universal answer, since it depends entirely on what the specific application actually does, such as binding to privileged ports or manipulating file ownership. Once identified, that minimal set becomes a relatively low-effort, high-leverage hardening step for any production container, mirroring the same least-privilege discipline this stack applies elsewhere.