Docker in Production and CI/CD

Difficulty

Why the default driver doesn't scale to real production needs

The default json-file driver (see the lifecycle topic) writes each container's logs to a local file on that specific host's disk, with no automatic rotation configured unless you explicitly set --log-opt max-size/max-file. Across a fleet of many hosts running many containers, this creates several problems. Logs are scattered across many machines with no unified way to search across all of them. A host being replaced (common in autoscaled or ephemeral infrastructure) takes its logs with it. Unbounded log growth can genuinely fill a host's disk if rotation isn't explicitly configured.

Setting a logging driver

docker run --log-driver=syslog --log-opt syslog-address=udp://loghost:514 myapp
docker run --log-driver=fluentd --log-opt fluentd-address=localhost:24224 myapp
docker run --log-driver=awslogs --log-opt awslogs-group=myapp --log-opt awslogs-region=us-east-1 myapp

Or, to set a default for the whole Docker daemon (rather than specifying it per-container):

// /etc/docker/daemon.json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Common production-oriented drivers

  • syslog — forwards to a syslog server, a long-established standard for centralized Unix/Linux logging.
  • journald — integrates with systemd's journal on hosts using systemd, useful if the rest of the host's own logging already goes through journald.
  • fluentd / gelf — forward to Fluentd or a GELF-compatible endpoint (like Graylog). These are common choices for feeding into an Elasticsearch/Loki-based centralized logging stack, the same kind of architecture covered in the Kubernetes stack's logging question. Here, Docker's own driver mechanism feeds the stack directly, rather than a separate log-shipping DaemonSet.
  • awslogs — forwards directly to AWS CloudWatch Logs, a natural fit when already running on AWS infrastructure.

The tradeoff: docker logs stops working locally

docker run --log-driver=awslogs myapp
docker logs myapp
# Error response from daemon: configured logging driver does not support reading

Once a non-default driver is configured, docker logs generally can no longer read the container's log output locally at all. Logs are only accessible through whatever external system the chosen driver forwards to. This is an important, sometimes-surprising tradeoff for teams used to reaching for docker logs directly during troubleshooting. The mental model needs to shift to "check the centralized logging system," not "SSH into the host and run docker logs."

Why this matters for anything beyond a single-host deployment

Centralized logging isn't optional once you're running more than a handful of containers across more than one host. Without it, diagnosing an issue that spans multiple services (a request that touches three different containers, possibly on three different hosts) requires manually checking logs on each machine individually. This becomes impractical fast. A centralized logging driver, feeding into a searchable system, is what makes cross-service, cross-host troubleshooting actually tractable at any real scale. It only needs to be configured once at the daemon level, so every container benefits without per-container setup.

A representative pipeline sequence

# Simplified CI pipeline concept
steps:
  - name: Build image
    run: docker build -t myapp:${{ github.sha }} .

  - name: Run tests inside a container
    run: docker run --rm myapp:${{ github.sha }} npm test

  - name: Scan for vulnerabilities
    run: trivy image --exit-code 1 --severity CRITICAL myapp:${{ github.sha }}

  - name: Push to registry
    run: |
      docker tag myapp:${{ github.sha }} myregistry.example.com/myapp:${{ github.sha }}
      docker push myregistry.example.com/myapp:${{ github.sha }}

Why building the image is often the very first step, before tests even run

Building the actual production image first, then running tests inside a container built from that image, ensures the tests genuinely validate the same environment that will actually run in production. This is different from running tests directly on the CI runner's own environment, separately from the image build. Doing it this way closes the exact "works on my machine/CI, breaks in production" gap that containers exist to solve in the first place (see the fundamentals topic). Running tests against a different environment than what's actually shipped defeats much of the purpose of containerizing the application at all.

Tagging with the commit SHA for traceability

docker build -t myapp:${{ github.sha }} .

Tagging each CI-built image with the specific git commit SHA that produced it (rather than only a generic version tag like latest or even 1.0) gives an unambiguous, traceable link between a specific running image and the exact source code that built it. This is essential for debugging ("which commit is actually deployed right now") and for the digest-pinning practices covered in the registries topic.

Leveraging build cache across CI runs

- name: Build with registry cache
  run: |
    docker build \
      --cache-from myregistry.example.com/myapp:latest \
      -t myapp:${{ github.sha }} .

A fresh CI runner typically starts with no local build cache at all, unlike a developer's own machine, which accumulates cache across many local builds. Without addressing this, every single CI build is effectively a full, uncached rebuild, however well the Dockerfile itself is ordered for caching (see that question). Two techniques let CI builds benefit from layer caching despite starting from a clean runner environment each time: pulling a previous build's image as an explicit cache source (--cache-from), or using BuildKit's remote cache export/import capability (see that question).

Multi-stage builds work especially well in CI

FROM node:20 AS test
WORKDIR /app
COPY . .
RUN npm ci && npm test

FROM node:20-slim AS production
WORKDIR /app
COPY --from=test /app/dist ./dist
CMD ["node", "dist/server.js"]

A dedicated test stage can run the full test suite (with all dev dependencies, test frameworks, etc.) while the final production stage only copies out the built artifacts. This combines the "test in the real build environment" benefit with the "final shipped image stays minimal" benefit from the multi-stage builds question, in one Dockerfile.

Security scanning as a CI gate

As covered in the registries topic's vulnerability-scanning question, integrating a scan step that can fail the build on critical/high findings prevents a genuinely dangerous image from ever reaching a registry or a deployment target. This catches the issue as early in the pipeline as practical.

Why this scenario comes up at all

Many CI/CD systems run each job inside its own container for isolation and reproducibility — but if that job's own work is to build a Docker image (a very common CI task), you end up needing to run Docker inside the container the CI job itself is running in. This is the scenario Docker-in-Docker addresses.

The DinD approach — a real, nested Docker daemon

docker run --privileged -d --name dind docker:24-dind
docker run --link dind:docker --env DOCKER_HOST=tcp://docker:2375 docker:24 docker build .

This runs an entirely separate Docker daemon inside a container, and a second container talks to that nested daemon to actually perform builds — genuinely running "Docker inside Docker," not just talking to the host's existing daemon.

The real risks of this approach

  • Requires --privileged mode — running a nested Docker daemon generally requires disabling most container isolation protections for the outer container. --privileged grants nearly all capabilities and disables several security restrictions covered in the security topic. This is a significant security relaxation, not a minor detail, and it directly undermines much of the isolation benefit containers are meant to provide in the first place.
  • Storage-driver complications — running a container filesystem (an overlay/union filesystem; see the fundamentals topic) inside another container's own overlay filesystem has historically caused genuine compatibility and performance issues. This happens because it layers the same kind of filesystem trickery on top of itself.
  • Weaker isolation than the "in Docker" framing suggests — despite feeling like it should be "extra isolated" (Docker inside Docker), the --privileged requirement actually means the outer container has less isolation from the host. An ordinary, non-privileged container would have more isolation than this setup provides.

Alternative 1: mounting the host's Docker socket

docker run -v /var/run/docker.sock:/var/run/docker.sock docker:24 docker build .

This avoids running a nested daemon at all — instead, the CI container talks directly to the host's own Docker daemon (see the security topic's Docker-socket question). This avoids DinD's storage-driver and --privileged concerns, but it introduces its own well-documented, serious risk. As covered in that question, socket access is functionally equivalent to host root. Any CI job with this mount has, in effect, host-level access, which is a serious concern for CI systems running untrusted or third-party pipeline code.

Alternative 2: purpose-built rootless image-building tools

# Kaniko, running inside a Kubernetes Pod with no special privileges,
# building an image without ever needing a Docker daemon (nested or host) at all

Tools like Kaniko (built by Google, commonly used in Kubernetes-based CI) and Buildah can build OCI-compliant images without requiring a Docker daemon at all. They implement the image-building logic directly in user space, without needing privileged access or a socket to any daemon. This genuinely avoids both of the above risks, rather than merely mitigating them. This is increasingly the preferred approach specifically for CI systems (especially Kubernetes-based ones) that need to build images as an ordinary, unprivileged step in an otherwise-sandboxed pipeline.

Weighing the tradeoffs

ApproachPrivilege requiredRisk profile
Docker-in-Docker (nested daemon)--privilegedSignificant isolation weakening; storage-driver quirks
Mounted host socketNone on the container itself, but socket access = host rootSerious, well-documented escalation risk
Kaniko / Buildah (daemonless)NoneAvoids both risks above entirely

Where DinD or socket-mounting genuinely can't be avoided (some legacy pipeline setups, specific tooling requirements), the right mental model is to treat that CI runner as a fully trusted, high-privilege environment. Restrict what pipelines are allowed to run in it accordingly. Do not treat it as just another routine, low-stakes CI job.

Related Resources

Docker Swarm — Docker's own, simpler built-in orchestrator

docker swarm init                                    # initialize a Swarm on this node
docker service create --name web --replicas 3 -p 80:80 nginx   # deploy a replicated service across the Swarm

Swarm mode turns a group of Docker hosts into a cluster, using concepts (services, tasks, overlay networks; see the networking topic's overlay question) that closely mirror plain Docker's own CLI and mental model. This closeness is Swarm's biggest advantage. Someone already comfortable with plain docker run/docker-compose concepts can pick up Swarm with relatively little additional learning. This is a much smaller learning curve compared to Kubernetes's much larger and more distinct set of concepts (Pods, Deployments, Services, ConfigMaps, RBAC, and dozens more; see that stack).

Kubernetes — the dominant, far more feature-rich orchestrator

Covered extensively in its own dedicated stack, Kubernetes provides sophisticated scheduling (affinity, taints/tolerations, priority/preemption), rich networking options (Ingress, NetworkPolicies, multiple CNI choices), and a vast ecosystem of extensions (CRDs, Operators, Helm charts for essentially any popular software). It is also what every major cloud provider offers a managed service for.

The key practical tradeoffs

Docker SwarmKubernetes
Learning curveGentle (builds directly on Docker concepts)Steep (many distinct concepts)
Feature richnessBasic (replicas, overlay networking, rolling updates)Extensive (see that stack's many topics)
Ecosystem/toolingSmall, and has been shrinkingEnormous, still growing
Managed cloud offeringsMinimalExtensive (EKS, GKE, AKS, and more)
Current industry momentumDecliningDominant

Why this comparison matters for an interview, even though the answer leans clearly toward Kubernetes today

Recommending Swarm for a brand-new production system today, given the industry's clear consolidation around Kubernetes, would be an unusual choice requiring strong specific justification. That justification might be a very small team, a very simple deployment need, and a strong existing preference for staying entirely within familiar plain-Docker concepts rather than adopting Kubernetes's larger surface area. A candidate should be able to articulate this landscape honestly. This means acknowledging Swarm's genuine, real simplicity advantage while recognizing that the ecosystem has broadly moved on. It also means not dismissing Swarm as having no merit at all, and not recommending it without appropriately weighing the tradeoff against Kubernetes's now-dominant position.

When Swarm might still be a reasonable, deliberate choice

  • A small team wanting basic multi-host orchestration (replicas, rolling updates, service discovery) without taking on Kubernetes's much larger learning curve and operational surface area.
  • An organization already deeply invested in plain Docker/Compose workflows, looking for the smallest possible step up to multi-host capability, rather than a much bigger architectural leap to Kubernetes.

The anti-pattern: building a separate image per environment

# BAD: separate builds per environment, baking in environment-specific config
docker build -t myapp:staging --build-arg API_URL=https://staging-api.example.com .
docker build -t myapp:production --build-arg API_URL=https://api.example.com .

This means myapp:staging and myapp:production are, strictly speaking, different artifacts. Even if the only intended difference is a configuration value, nothing guarantees the build process itself produced byte-for-byte identical images apart from that one value. A subtle build-time issue (a flaky dependency resolution, a build tool behaving slightly differently) could introduce an unintended difference between what was tested in staging and what actually ships to production. This directly undermines the confidence that "what we tested is exactly what we're deploying."

The correct pattern: build once, configure at runtime

docker build -t myapp:1.0 .          # ONE build, used everywhere

docker run -e API_URL=https://staging-api.example.com myapp:1.0        # staging
docker run -e API_URL=https://api.example.com myapp:1.0                 # production

The exact same image, byte-for-byte, is what runs in every environment. Only the runtime configuration (environment variables, mounted config files, secrets) differs. This is precisely the "build once, deploy many times, unchanged" principle covered throughout this stack (see the fundamentals topic and the tags/digests question). It means that if something works correctly in staging, you have real, direct confidence that the identical artifact will behave the same way in production, since nothing about the image itself changed between the two.

The twelve-factor app's "config" principle

This directly reflects Factor III (Config) of the twelve-factor app methodology. It calls for strict separation between an application's code (which should be identical across environments) and its configuration (which legitimately varies by environment). Configuration belongs in the environment (environment variables, mounted files), never hardcoded into the build artifact itself.

# Kubernetes ConfigMaps/Secrets (see that stack), or Compose environment/.env
# files, or a cloud platform's own environment-variable configuration --
# all apply this same principle at whatever layer is actually deploying the container

This principle is exactly why Kubernetes ConfigMaps/Secrets (see that stack) and Compose's environment/env_file mechanisms (see that topic) both exist as first-class concepts. They are the standard, orchestrator-level tools for injecting environment-specific configuration into an unchanged, promoted image, rather than requiring separate builds.

What this means for CI/CD pipeline design

1. Build the image ONCE, from a specific commit, tagged with that commit's SHA
2. Run tests against THAT SAME image
3. Push it to a registry
4. Deploy that SAME image (by digest, ideally -- see that question) to staging,
   with staging-specific configuration injected at deploy time
5. After validation, promote the SAME image (same digest) to production,
   with production-specific configuration injected at deploy time

This "build once, promote the same artifact through environments" pattern is a core CI/CD design principle. It eliminates an entire class of "it worked in staging but broke in production" bugs. Those bugs stem from staging and production having actually run subtly different artifacts, rather than the same one with different configuration.