Pods and Workload Controllers

Difficulty

What a Pod actually is

apiVersion: v1
kind: Pod
metadata:
  name: web-with-logger
spec:
  containers:
    - name: web
      image: myapp:1.0
      ports:
        - containerPort: 8080
    - name: log-shipper
      image: fluentd:latest
      volumeMounts:
        - name: shared-logs
          mountPath: /var/log/app
  volumes:
    - name: shared-logs
      emptyDir: {}

Both containers in this Pod share: one network namespace (they can reach each other via localhost, and both see the Pod's single IP address from the outside), and, if explicitly configured, shared volumes (both can read/write the same shared-logs volume). Kubernetes schedules, starts, stops, and monitors the Pod as a unit — both containers always land on the same node together, and if the Pod is deleted, both containers go with it.

Why not just schedule individual containers directly

Some containers are only useful together, tightly coupled by design — the sidecar pattern (see that question) is the clearest example: a log-shipping or service-mesh proxy container that needs to share the main application container's network namespace and/or filesystem to do its job. If Kubernetes scheduled and placed containers entirely independently, there'd be no way to guarantee two specific containers always land on the same node, or to let them share localhost networking and volumes — you'd need an entirely separate mechanism just to express "these two things belong together." The Pod is that mechanism, built into the core scheduling unit from the start.

Most Pods have exactly one container

Despite the multi-container capability, the overwhelming majority of Pods in practice contain a single container — the multi-container case is specifically for the sidecar/init-container patterns where genuine tight coupling is needed, not a general-purpose way to bundle unrelated services together. Two unrelated services (e.g., a web frontend and a completely separate backend API) should almost always be separate Pods (typically each managed by its own Deployment), not stuffed into containers within the same Pod — that would incorrectly couple their scaling (you can't scale one container in a Pod independently of the others) and their lifecycle (a crash in one container can affect Pod-level restart behavior for the whole Pod).

Pods are ephemeral and not directly managed in production

You essentially never create bare Pods directly in production — they have no self-healing behavior on their own (a Pod that's deleted or whose node dies is simply gone, with nothing replacing it) and no rollout/rollback mechanism. Instead, you create a higher-level controller — a Deployment, StatefulSet, DaemonSet, or Job (see the following questions) — which manages a template for creating Pods and handles replacing, scaling, and updating them for you. Understanding that Pods are the unit of scheduling and execution, while Deployments/StatefulSets/etc. are the unit of desired state management, is the key distinction this topic is really testing.

Related Resources

ReplicaSet — maintains a stable count of identical Pods

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: web-rs
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: myapp:1.0

The ReplicaSet controller's reconciliation loop (see the control loop question) continuously ensures exactly 3 Pods matching the app: web label selector exist — if one is deleted or its node dies, a replacement is created; if you manually create a 4th matching Pod, it will be deleted to bring the count back down to 3.

What a ReplicaSet cannot do: change the Pod template's image version in a controlled, gradual way. If you edit a running ReplicaSet's image field, nothing happens to existing Pods — the new template only applies to future Pods it creates, so you'd have to manually delete old Pods one by one to see them replaced with the new version, with no coordination, health checking, or rollback built in.

Deployment — adds rollout management on top

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: myapp:1.0

A Deployment creates and owns a ReplicaSet with this spec. When you update the Deployment's image (kubectl set image deployment/web web=myapp:2.0), the Deployment controller creates a new ReplicaSet with the updated template, and gradually scales the new one up while scaling the old one down — according to the configured strategy — rather than mutating Pods in place (see the rolling update question for the mechanics).

The layering, visualized

Deployment (rollout/rollback logic, revision history)
   └── ReplicaSet (maintains N identical replicas of one Pod template)
         └── Pod, Pod, Pod, ...

A Deployment update creates a new ReplicaSet rather than editing the existing one — this is precisely what enables rollback: the old ReplicaSet (scaled down to 0, but not deleted) still exists with its original template, so rolling back is just re-scaling it back up while scaling the current one down (kubectl rollout undo deployment/web).

You should essentially never create a bare ReplicaSet directly in production — always create a Deployment, and let it manage the underlying ReplicaSet(s) for you. Direct ReplicaSet management is mostly useful to understand conceptually, since it's the mechanism Deployments build on, and occasionally shows up when debugging why kubectl get replicasets shows old, scaled-to-zero ReplicaSets lingering after several rollouts (this is expected — it's the Deployment's revision history, bounded by the revisionHistoryLimit field).

Related Resources

The default strategy: RollingUpdate

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1   # at most 1 fewer than desired can be unavailable during the rollout
      maxSurge: 1          # at most 1 MORE than desired can exist during the rollout

With replicas: 4, maxUnavailable: 1, maxSurge: 1: the Deployment controller can have as few as 3 (4 - 1) and as many as 5 (4 + 1) total Pods (old + new combined) at any moment during the rollout. It creates a new Pod (from the new ReplicaSet), waits for it to pass its readiness probe (see the probes question) before considering it available, then terminates one old Pod — repeating until all old Pods are replaced.

Start:    [old][old][old][old]                    (4 old, 0 new)
Step 1:   [old][old][old][old][new]                (surge: 5 total, new not ready yet)
Step 2:   [old][old][old][new✓]                    (new became ready, one old terminated: 4 total)
Step 3:   [old][old][old][new✓][new]                (surge again: 5 total)
Step 4:   [old][old][new✓][new✓]                    (another old terminated: 4 total)
...continues until all 4 are the new version

Why readiness probes are essential to a safe rollout

The rollout only proceeds to terminate an old Pod once a new Pod is marked Ready by its readiness probe (see the probes question) — if the new version has a bug that causes it to fail its readiness check (e.g., it crashes on startup, or can't connect to a dependency), the rollout stalls rather than continuing to replace healthy old Pods with broken new ones. This is a critical safety property: without correctly configured readiness probes, a rolling update has no way to detect a bad new version and will happily replace every healthy Pod with broken ones.

Rolling back

kubectl rollout status deployment/web        # watch a rollout's progress
kubectl rollout history deployment/web        # see past revisions
kubectl rollout undo deployment/web           # roll back to the previous revision
kubectl rollout undo deployment/web --to-revision=3   # roll back to a specific revision

rollout undo works by re-pointing the Deployment at a previous ReplicaSet's Pod template (retained, scaled to zero, from an earlier rollout — up to spec.revisionHistoryLimit, default 10) and performing the same gradual rolling process, just in the opposite direction — scaling the old (soon-to-be-current-again) ReplicaSet up while scaling the currently-bad one down. This means rollback gets the exact same safety properties (readiness-gated, gradual) as a forward rollout.

Pausing a rollout mid-flight

kubectl rollout pause deployment/web    # freeze the rollout at its current state
kubectl rollout resume deployment/web   # continue it

Useful for making several related changes to a Deployment's spec (e.g., updating both the image and a resource limit) without triggering a rollout after each individual edit — pause, make all your changes, then resume to trigger exactly one coordinated rollout.

What rolling updates don't protect against

A rolling update only protects against a new version that fails its readiness probe — a bug that passes readiness checks but causes incorrect behavior under real production traffic (a subtle logic error, a slow memory leak) won't be caught by the rollout mechanism itself. This is exactly the gap that canary deployments and more sophisticated progressive-delivery tooling (see the production operations topic) are designed to close.

Why Deployments are wrong for stateful applications

A Deployment's Pods are interchangeable — they get randomly-suffixed names (web-7d8f9c-x2k4p), no guaranteed stable identity, and if you use a PersistentVolumeClaim in a Deployment's Pod template, every replica shares the same PVC (or, more commonly, each gets a fresh empty volume depending on configuration) — there's no built-in way to give each replica its own dedicated, durable, individually-tracked storage that follows that specific replica across restarts.

What a StatefulSet provides instead

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "web"     # must reference a headless Service (see the networking topic)
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: myapp:1.0
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
  • Stable, predictable Pod names: web-0, web-1, web-2 — not random suffixes. If web-1 is deleted, its replacement is created with the exact same name web-1, not a new random one.
  • Stable network identity: combined with a headless Service, each Pod gets a predictable, individually-addressable DNS name (web-0.web.default.svc.cluster.local) — essential for applications where peers need to address a specific other instance by name (e.g., a database replica connecting to a specific primary).
  • Per-replica persistent storage (volumeClaimTemplates): each replica gets its own PVC (data-web-0, data-web-1, data-web-2), and critically, if web-1's Pod is deleted and recreated (even on a different node), it's reattached to the same data-web-1 PVC — its data survives, tied to its identity, not to whichever node happened to run it.
  • Ordered, sequential deployment and scaling: by default, StatefulSet Pods are created, updated, and terminated one at a time, in order (web-0 before web-1 before web-2), which matters for applications with ordering dependencies (e.g., a database's designated primary must come up before replicas that need to connect to it).

When you actually need this

  • Databases and distributed data stores run directly on Kubernetes (PostgreSQL, MongoDB, Cassandra, Elasticsearch) — each replica typically holds a distinct portion of data and needs stable identity to know its role and reconnect to its own data after a restart.
  • Distributed coordination systems (ZooKeeper, etcd itself, when run on Kubernetes) where each member needs a stable identity to participate correctly in a consensus protocol.
  • Any application where "which replica am I" is meaningful to the application's own logic, not just an interchangeable unit of horizontal scale.

When you don't

Stateless web servers, API services, or workers that don't care which specific instance handles a given request, and don't need to persist state tied to a specific replica's identity — these are the common case, and a Deployment (simpler, with more flexible rollout behavior) is the right default. Reaching for a StatefulSet when a Deployment would do adds real operational complexity (slower, ordered rollouts; PVC lifecycle management) for no corresponding benefit.

An important caveat

Running genuinely stateful, data-critical systems like production databases directly on Kubernetes (rather than using a managed cloud database service) is itself a significant operational commitment — StatefulSets solve the scheduling and identity problem, but backup, failover, and data consistency logic for the actual stateful application usually still needs to be handled by an Operator (see the extensibility topic) or the application's own clustering logic, not by the StatefulSet primitive alone.

Related Resources

What makes a DaemonSet different from a Deployment

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: log-collector
spec:
  selector:
    matchLabels:
      app: log-collector
  template:
    metadata:
      labels:
        app: log-collector
    spec:
      containers:
        - name: fluentd
          image: fluentd:latest

Notice there's no replicas field — a DaemonSet doesn't have a fixed count you specify; instead, it runs exactly one Pod per eligible node, automatically. Add a new node to the cluster, and the DaemonSet controller schedules a Pod onto it immediately, with no manual action needed; remove a node (or drain and delete it), and its DaemonSet Pod goes with it. This is fundamentally different from a Deployment, whose replica count is a fixed number that has no inherent relationship to the number of nodes in the cluster.

Common use cases

  • Log collection agents (Fluentd, Filebeat, Fluent Bit) — need to run on every node to read and forward container logs written to that node's local disk.
  • Monitoring/metrics agents (Prometheus Node Exporter, Datadog Agent) — need node-level visibility (CPU, memory, disk) that can only be gathered by something running directly on each machine.
  • CNI network plugins (Calico, Cilium, Flannel) — network configuration that must be set up identically on every node for pod-to-pod networking to work cluster-wide (see the networking topic).
  • Storage daemons (Ceph, GlusterFS node agents) — some distributed storage systems need an agent on every node that might mount their volumes.

Restricting a DaemonSet to a subset of nodes

spec:
  template:
    spec:
      nodeSelector:
        disktype: ssd     # only run on nodes labeled disktype=ssd

Despite the name suggesting "every node," a DaemonSet can be scoped with a nodeSelector or node affinity rules (see the scheduling topic) to only run on a labeled subset — useful for something like a specialized storage agent that only needs to run on nodes actually equipped with the relevant hardware/disk type.

Interaction with taints and tolerations

By default, most Pods won't be scheduled onto a control-plane/master node (which is typically tainted to repel ordinary workloads — see the scheduling topic). DaemonSets commonly include a toleration for these taints specifically because infrastructure agents like log collectors and monitoring tools usually do need to run even on control-plane nodes, unlike ordinary application workloads.

spec:
  template:
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule

Rolling updates for DaemonSets

DaemonSets support a rolling update strategy conceptually similar to Deployments (RollingUpdate or OnDelete), updating one node's Pod at a time rather than all at once — important for infrastructure-critical DaemonSets (like a CNI plugin) where updating every node's networking agent simultaneously could cause a cluster-wide networking outage.

Related Resources