What is a Pod, and why is it the smallest deployable unit rather than a container?

A Pod is a group of one or more containers that share the same network namespace (one IP address, one port space) and can share storage volumes — it's the smallest unit Kubernetes schedules and manages, not the individual container. This exists because some containers genuinely need to be co-located and tightly coupled (e.g., a main application container plus a logging sidecar that reads its logs from a shared volume), and Kubernetes needed a unit that could represent "these containers must run together, on the same node, sharing network and storage" — a single container can't express that relationship on its own.

What's the difference between a ReplicaSet and a Deployment?

A ReplicaSet's only job is ensuring a specified number of identical Pod replicas are running at all times, replacing any that die. A Deployment sits one layer above a ReplicaSet and adds rollout management — it creates and manages ReplicaSets on your behalf, and knows how to perform a rolling update (creating a new ReplicaSet for the new version, scaling it up while scaling the old one down) and how to roll back to a previous version. In practice, you almost always create Deployments directly and let them manage ReplicaSets for you, rather than creating a ReplicaSet by hand.

How does a rolling update work, and how do you roll back a bad deployment?

A rolling update gradually replaces Pods running the old version with Pods running the new version, controlled by `maxUnavailable` (how many old Pods can be taken down before their replacements are ready) and `maxSurge` (how many extra new Pods can be created above the desired count during the transition) — new Pods only start receiving traffic once they pass their readiness probe. Rolling back is `kubectl rollout undo deployment/ `, which re-applies a previous revision's Pod template and performs the same gradual rolling process in reverse.

What is a StatefulSet, and when do you need one instead of a Deployment?

A StatefulSet manages Pods that need a stable, unique identity and stable storage across restarts — each Pod gets a predictable, persistent name (`web-0`, `web-1`, ...) and, if configured, its own dedicated PersistentVolumeClaim that follows it even if the Pod is rescheduled. Use a StatefulSet for stateful applications where individual instances have distinct identity or data — databases, distributed message queues, anything where "which specific instance am I" and "my data must survive my restart" both matter — and a Deployment for stateless, interchangeable replicas.

What is a DaemonSet, and what's a common use case?

A DaemonSet ensures that exactly one copy of a Pod runs on every node in the cluster (or every node matching a selector) — as nodes are added, a Pod is automatically scheduled onto them; as nodes are removed, their Pod goes with them. It's the standard way to run node-level infrastructure agents that genuinely need to exist on every machine: log collectors, monitoring/metrics agents, and CNI network plugins.

What's the difference between a Job and a CronJob?

A Job runs one or more Pods to completion for a single, one-off task, and tracks successful completions — it's for work that runs once and finishes, unlike a Deployment's Pods, which are expected to run indefinitely and get restarted if they exit. A CronJob is a Job template that gets triggered automatically on a recurring schedule (using standard cron syntax), creating a new Job instance each time it fires — for tasks like nightly backups or scheduled report generation.

What are init containers, and how do they differ from regular containers in a Pod?

Init containers run and must complete successfully, one at a time in order, *before* any of a Pod's regular (main) containers start. They're used for setup work that must finish before the application starts — waiting for a dependency to become available, running a one-time setup script, or populating a shared volume with data the main container needs — and unlike regular containers, they aren't expected to run for the Pod's whole lifetime.

What is the sidecar container pattern, and what problems does it solve?

A sidecar is a helper container that runs alongside a Pod's main application container, sharing its network namespace and (often) its volumes, to extend or support the main container's functionality without modifying the application's own code or image — common examples are a log-shipping agent, a service mesh proxy (like Envoy in Istio), or a configuration-reloading helper. It solves the problem of adding cross-cutting infrastructure concerns (observability, networking, security) to an application uniformly, without every application team needing to build that logic into their own container image.

What are the phases of a Pod's lifecycle?

A Pod's `status.phase` is one of: **Pending** (accepted by the cluster, but one or more containers not yet running — commonly waiting to be scheduled or waiting on an image pull), **Running** (bound to a node, at least one container running), **Succeeded** (all containers terminated successfully, won't restart — the normal end state for a Job's Pod), **Failed** (all containers terminated, at least one with failure and won't restart), or **Unknown** (the Pod's state can't be determined, typically because the node it's on is unreachable). This top-level phase is a coarse summary — the real diagnostic detail lives in each container's individual status and the Pod's conditions.

What is a PodDisruptionBudget, and why does it matter during voluntary disruptions?

A PodDisruptionBudget (PDB) limits how many Pods of a given application can be voluntarily taken down at once — during a node drain, a cluster upgrade, or a manual `kubectl delete` on a set of Pods — by specifying a minimum number/percentage that must remain available. It doesn't protect against involuntary disruptions (a node crashing unexpectedly), only voluntary ones initiated through the Eviction API, ensuring routine cluster maintenance doesn't accidentally take down too many replicas of a service at the same time and cause an outage.

Pods and Workload Controllers

Pods, Deployments, StatefulSets, DaemonSets, Jobs, and the patterns for running and updating containerized workloads.

Difficulty

Open as page

What a Pod actually is

apiVersion: v1
kind: Pod
metadata:
  name: web-with-logger
spec:
  containers:
    - name: web
      image: myapp:1.0
      ports:
        - containerPort: 8080
    - name: log-shipper
      image: fluentd:latest
      volumeMounts:
        - name: shared-logs
          mountPath: /var/log/app
  volumes:
    - name: shared-logs
      emptyDir: {}

Both containers in this Pod share: one network namespace (they can reach each other via localhost, and both see the Pod's single IP address from the outside), and, if explicitly configured, shared volumes (both can read/write the same shared-logs volume). Kubernetes schedules, starts, stops, and monitors the Pod as a unit — both containers always land on the same node together, and if the Pod is deleted, both containers go with it.

Why not just schedule individual containers directly

Some containers are only useful together, tightly coupled by design — the sidecar pattern (see that question) is the clearest example: a log-shipping or service-mesh proxy container that needs to share the main application container's network namespace and/or filesystem to do its job. If Kubernetes scheduled and placed containers entirely independently, there'd be no way to guarantee two specific containers always land on the same node, or to let them share localhost networking and volumes — you'd need an entirely separate mechanism just to express "these two things belong together." The Pod is that mechanism, built into the core scheduling unit from the start.

Most Pods have exactly one container

Despite the multi-container capability, the overwhelming majority of Pods in practice contain a single container — the multi-container case is specifically for the sidecar/init-container patterns where genuine tight coupling is needed, not a general-purpose way to bundle unrelated services together. Two unrelated services (e.g., a web frontend and a completely separate backend API) should almost always be separate Pods (typically each managed by its own Deployment), not stuffed into containers within the same Pod — that would incorrectly couple their scaling (you can't scale one container in a Pod independently of the others) and their lifecycle (a crash in one container can affect Pod-level restart behavior for the whole Pod).

Pods are ephemeral and not directly managed in production

You essentially never create bare Pods directly in production — they have no self-healing behavior on their own (a Pod that's deleted or whose node dies is simply gone, with nothing replacing it) and no rollout/rollback mechanism. Instead, you create a higher-level controller — a Deployment, StatefulSet, DaemonSet, or Job (see the following questions) — which manages a template for creating Pods and handles replacing, scaling, and updating them for you. Understanding that Pods are the unit of scheduling and execution, while Deployments/StatefulSets/etc. are the unit of desired state management, is the key distinction this topic is really testing.

Related Resources

Kubernetes: Pods

Open as page

ReplicaSet — maintains a stable count of identical Pods

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: web-rs
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: myapp:1.0

The ReplicaSet controller's reconciliation loop (see the control loop question) continuously ensures exactly 3 Pods matching the app: web label selector exist — if one is deleted or its node dies, a replacement is created; if you manually create a 4th matching Pod, it will be deleted to bring the count back down to 3.

What a ReplicaSet cannot do: change the Pod template's image version in a controlled, gradual way. If you edit a running ReplicaSet's image field, nothing happens to existing Pods — the new template only applies to future Pods it creates, so you'd have to manually delete old Pods one by one to see them replaced with the new version, with no coordination, health checking, or rollback built in.

Deployment — adds rollout management on top

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: myapp:1.0

A Deployment creates and owns a ReplicaSet with this spec. When you update the Deployment's image (kubectl set image deployment/web web=myapp:2.0), the Deployment controller creates a new ReplicaSet with the updated template, and gradually scales the new one up while scaling the old one down — according to the configured strategy — rather than mutating Pods in place (see the rolling update question for the mechanics).

The layering, visualized

Deployment (rollout/rollback logic, revision history)
   └── ReplicaSet (maintains N identical replicas of one Pod template)
         └── Pod, Pod, Pod, ...

A Deployment update creates a new ReplicaSet rather than editing the existing one — this is precisely what enables rollback: the old ReplicaSet (scaled down to 0, but not deleted) still exists with its original template, so rolling back is just re-scaling it back up while scaling the current one down (kubectl rollout undo deployment/web).

You should essentially never create a bare ReplicaSet directly in production — always create a Deployment, and let it manage the underlying ReplicaSet(s) for you. Direct ReplicaSet management is mostly useful to understand conceptually, since it's the mechanism Deployments build on, and occasionally shows up when debugging why kubectl get replicasets shows old, scaled-to-zero ReplicaSets lingering after several rollouts (this is expected — it's the Deployment's revision history, bounded by the revisionHistoryLimit field).

Related Resources

Kubernetes: Deployments

Open as page

The default strategy: RollingUpdate

spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1   # at most 1 fewer than desired can be unavailable during the rollout
      maxSurge: 1          # at most 1 MORE than desired can exist during the rollout

With replicas: 4, maxUnavailable: 1, maxSurge: 1: the Deployment controller can have as few as 3 (4 - 1) and as many as 5 (4 + 1) total Pods (old + new combined) at any moment during the rollout. It creates a new Pod (from the new ReplicaSet), waits for it to pass its readiness probe (see the probes question) before considering it available, then terminates one old Pod — repeating until all old Pods are replaced.

Start:    [old][old][old][old]                    (4 old, 0 new)
Step 1:   [old][old][old][old][new]                (surge: 5 total, new not ready yet)
Step 2:   [old][old][old][new✓]                    (new became ready, one old terminated: 4 total)
Step 3:   [old][old][old][new✓][new]                (surge again: 5 total)
Step 4:   [old][old][new✓][new✓]                    (another old terminated: 4 total)
...continues until all 4 are the new version

Why readiness probes are essential to a safe rollout

The rollout only proceeds to terminate an old Pod once a new Pod is marked Ready by its readiness probe (see the probes question) — if the new version has a bug that causes it to fail its readiness check (e.g., it crashes on startup, or can't connect to a dependency), the rollout stalls rather than continuing to replace healthy old Pods with broken new ones. This is a critical safety property: without correctly configured readiness probes, a rolling update has no way to detect a bad new version and will happily replace every healthy Pod with broken ones.

Rolling back

kubectl rollout status deployment/web        # watch a rollout's progress
kubectl rollout history deployment/web        # see past revisions
kubectl rollout undo deployment/web           # roll back to the previous revision
kubectl rollout undo deployment/web --to-revision=3   # roll back to a specific revision

rollout undo works by re-pointing the Deployment at a previous ReplicaSet's Pod template (retained, scaled to zero, from an earlier rollout — up to spec.revisionHistoryLimit, default 10) and performing the same gradual rolling process, just in the opposite direction — scaling the old (soon-to-be-current-again) ReplicaSet up while scaling the currently-bad one down. This means rollback gets the exact same safety properties (readiness-gated, gradual) as a forward rollout.

Pausing a rollout mid-flight

kubectl rollout pause deployment/web    # freeze the rollout at its current state
kubectl rollout resume deployment/web   # continue it

Useful for making several related changes to a Deployment's spec (e.g., updating both the image and a resource limit) without triggering a rollout after each individual edit — pause, make all your changes, then resume to trigger exactly one coordinated rollout.

What rolling updates don't protect against

A rolling update only protects against a new version that fails its readiness probe — a bug that passes readiness checks but causes incorrect behavior under real production traffic (a subtle logic error, a slow memory leak) won't be caught by the rollout mechanism itself. This is exactly the gap that canary deployments and more sophisticated progressive-delivery tooling (see the production operations topic) are designed to close.

Related Resources

Kubernetes: Performing a Rolling Update

Open as page

Why Deployments are wrong for stateful applications

A Deployment's Pods are interchangeable — they get randomly-suffixed names (web-7d8f9c-x2k4p), no guaranteed stable identity, and if you use a PersistentVolumeClaim in a Deployment's Pod template, every replica shares the same PVC (or, more commonly, each gets a fresh empty volume depending on configuration) — there's no built-in way to give each replica its own dedicated, durable, individually-tracked storage that follows that specific replica across restarts.

What a StatefulSet provides instead

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "web"     # must reference a headless Service (see the networking topic)
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: myapp:1.0
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

Stable, predictable Pod names: web-0, web-1, web-2 — not random suffixes. If web-1 is deleted, its replacement is created with the exact same name web-1, not a new random one.
Stable network identity: combined with a headless Service, each Pod gets a predictable, individually-addressable DNS name (web-0.web.default.svc.cluster.local) — essential for applications where peers need to address a specific other instance by name (e.g., a database replica connecting to a specific primary).
Per-replica persistent storage (volumeClaimTemplates): each replica gets its own PVC (data-web-0, data-web-1, data-web-2), and critically, if web-1's Pod is deleted and recreated (even on a different node), it's reattached to the same data-web-1 PVC — its data survives, tied to its identity, not to whichever node happened to run it.
Ordered, sequential deployment and scaling: by default, StatefulSet Pods are created, updated, and terminated one at a time, in order (web-0 before web-1 before web-2), which matters for applications with ordering dependencies (e.g., a database's designated primary must come up before replicas that need to connect to it).

When you actually need this

Databases and distributed data stores run directly on Kubernetes (PostgreSQL, MongoDB, Cassandra, Elasticsearch) — each replica typically holds a distinct portion of data and needs stable identity to know its role and reconnect to its own data after a restart.
Distributed coordination systems (ZooKeeper, etcd itself, when run on Kubernetes) where each member needs a stable identity to participate correctly in a consensus protocol.
Any application where "which replica am I" is meaningful to the application's own logic, not just an interchangeable unit of horizontal scale.

When you don't

Stateless web servers, API services, or workers that don't care which specific instance handles a given request, and don't need to persist state tied to a specific replica's identity — these are the common case, and a Deployment (simpler, with more flexible rollout behavior) is the right default. Reaching for a StatefulSet when a Deployment would do adds real operational complexity (slower, ordered rollouts; PVC lifecycle management) for no corresponding benefit.

An important caveat

Running genuinely stateful, data-critical systems like production databases directly on Kubernetes (rather than using a managed cloud database service) is itself a significant operational commitment — StatefulSets solve the scheduling and identity problem, but backup, failover, and data consistency logic for the actual stateful application usually still needs to be handled by an Operator (see the extensibility topic) or the application's own clustering logic, not by the StatefulSet primitive alone.

Related Resources

Kubernetes: StatefulSets

Open as page

What makes a DaemonSet different from a Deployment

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: log-collector
spec:
  selector:
    matchLabels:
      app: log-collector
  template:
    metadata:
      labels:
        app: log-collector
    spec:
      containers:
        - name: fluentd
          image: fluentd:latest

Notice there's no replicas field — a DaemonSet doesn't have a fixed count you specify; instead, it runs exactly one Pod per eligible node, automatically. Add a new node to the cluster, and the DaemonSet controller schedules a Pod onto it immediately, with no manual action needed; remove a node (or drain and delete it), and its DaemonSet Pod goes with it. This is fundamentally different from a Deployment, whose replica count is a fixed number that has no inherent relationship to the number of nodes in the cluster.

Common use cases

Log collection agents (Fluentd, Filebeat, Fluent Bit) — need to run on every node to read and forward container logs written to that node's local disk.
Monitoring/metrics agents (Prometheus Node Exporter, Datadog Agent) — need node-level visibility (CPU, memory, disk) that can only be gathered by something running directly on each machine.
CNI network plugins (Calico, Cilium, Flannel) — network configuration that must be set up identically on every node for pod-to-pod networking to work cluster-wide (see the networking topic).
Storage daemons (Ceph, GlusterFS node agents) — some distributed storage systems need an agent on every node that might mount their volumes.

Restricting a DaemonSet to a subset of nodes

spec:
  template:
    spec:
      nodeSelector:
        disktype: ssd     # only run on nodes labeled disktype=ssd

Despite the name suggesting "every node," a DaemonSet can be scoped with a nodeSelector or node affinity rules (see the scheduling topic) to only run on a labeled subset — useful for something like a specialized storage agent that only needs to run on nodes actually equipped with the relevant hardware/disk type.

Interaction with taints and tolerations

By default, most Pods won't be scheduled onto a control-plane/master node (which is typically tainted to repel ordinary workloads — see the scheduling topic). DaemonSets commonly include a toleration for these taints specifically because infrastructure agents like log collectors and monitoring tools usually do need to run even on control-plane nodes, unlike ordinary application workloads.

spec:
  template:
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule

Rolling updates for DaemonSets

DaemonSets support a rolling update strategy conceptually similar to Deployments (RollingUpdate or OnDelete), updating one node's Pod at a time rather than all at once — important for infrastructure-critical DaemonSets (like a CNI plugin) where updating every node's networking agent simultaneously could cause a cluster-wide networking outage.

Related Resources

Kubernetes: DaemonSet

Open as page

Job — run-to-completion, not run-forever

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  completions: 1
  backoffLimit: 3       # retry up to 3 times on failure before giving up
  template:
    spec:
      containers:
        - name: migrate
          image: myapp-migrator:1.0
      restartPolicy: Never

The key behavioral difference from a Deployment: a Deployment expects its Pods to run indefinitely, and treats a container exiting as a failure to be restarted; a Job expects its Pod(s) to eventually exit successfully (exit code 0), and considers that success, not failure — the Job is then marked Complete and no new Pods are created. A Job's Pod template must specify restartPolicy: Never or OnFailure (never Always, which would conflict with the run-to-completion model).

Parallel and repeated Jobs

spec:
  completions: 5    # need 5 total successful Pod completions
  parallelism: 2    # run at most 2 Pods concurrently

Jobs can run a single Pod once, run multiple Pods in parallel (for a parallelizable batch task, like processing a fixed batch of work items), or use a work-queue pattern where Pods pull tasks from an external queue until the queue is empty. backoffLimit controls how many times a failed Pod is retried before the whole Job is marked as failed.

CronJob — a Job on a recurring schedule

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"    # standard cron syntax: 2am daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: backup-tool:1.0
          restartPolicy: OnFailure

At each scheduled trigger time, the CronJob controller creates a new Job object (using jobTemplate as the spec) — each firing is an entirely independent Job, tracked and retried according to that Job's own backoffLimit, completely separate from any previous or future firing.

Handling overlapping/missed runs

spec:
  concurrencyPolicy: Forbid   # don't start a new run if the previous one is still going
  startingDeadlineSeconds: 200 # if a scheduled run is missed by more than this, skip it

concurrencyPolicy controls what happens if a scheduled run's Job is still active when the next scheduled time arrives: Allow (default — run concurrently), Forbid (skip the new run), or Replace (cancel the still-running one and start the new one). This matters for tasks where overlapping runs would cause real problems (e.g., two concurrent database migration jobs stepping on each other) versus tasks where it's harmless.

Common use cases

Jobs: one-off data migrations, batch processing of a fixed dataset, running a database schema migration as part of a deployment pipeline.
CronJobs: nightly backups, scheduled report generation, periodic cleanup tasks (purging old data, rotating logs), health-check/synthetic-monitoring pings on a schedule.

A common gotcha

Completed Job Pods aren't automatically deleted by default (only the newest few, bounded by spec.successfulJobsHistoryLimit/failedJobsHistoryLimit for CronJobs) — over time, an unmonitored, frequently-firing CronJob can accumulate a large number of completed Job and Pod objects, which is a common, easy-to-overlook source of cluster object clutter (and, at large enough scale, real etcd/API server load) if history limits aren't configured sensibly.

Related Resources

Kubernetes: Jobs

Open as page

Anatomy

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  initContainers:
    - name: wait-for-db
      image: busybox
      command: ['sh', '-c', 'until nc -z db-service 5432; do sleep 2; done']
    - name: run-migrations
      image: myapp-migrator:1.0
      command: ['./migrate.sh']
  containers:
    - name: app
      image: myapp:1.0

Execution order: wait-for-db runs to completion first, then run-migrations runs to completion, and only after both succeed does the main app container start. If any init container fails, Kubernetes retries that init container (subject to the Pod's restartPolicy) — the main containers never start until every init container has exited successfully.

Key differences from regular containers

	Init containers	Regular (main) containers
Execution order	Sequential, one at a time, in the order listed	Started together, run concurrently
Expected to finish	Yes — must exit successfully to proceed	No — expected to keep running
Run before main containers start	Yes, always	N/A
Restart on crash	Retried per Pod's `restartPolicy`, blocking Pod startup until success	Restarted per `restartPolicy`, but the Pod is already considered started
Can use different images/tools than the app	Yes — commonly a minimal utility image	Usually the application's own image

Common use cases

Waiting for a dependency — blocking until a database or another service is reachable before starting the application, avoiding a crash-loop of the main container repeatedly failing to connect during startup.
One-time setup — running database migrations, downloading/generating a configuration file, or performing a one-time registration step.
Populating a shared volume — an init container can write files (e.g., cloning a git repo, or fetching static assets) into a volume that the main container then mounts and serves, keeping tooling needed only for setup (like git) out of the main application's image entirely.

initContainers:
  - name: fetch-content
    image: alpine/git
    command: ['git', 'clone', 'https://github.com/example/content.git', '/content']
    volumeMounts:
      - name: content-volume
        mountPath: /content
containers:
  - name: web
    image: nginx
    volumeMounts:
      - name: content-volume
        mountPath: /usr/share/nginx/html
volumes:
  - name: content-volume
    emptyDir: {}

Why not just put this logic in the main container's entrypoint script

You could — but separating it into an init container has real advantages: it keeps the main application image free of setup-only tooling (smaller image, smaller attack surface), gives setup failures a distinct, separately-visible status in kubectl get pods (an init container failure shows as Init:Error or Init:CrashLoopBackOff, immediately telling you the problem is in setup, not the application itself), and cleanly separates "must happen once, in order, before anything else" logic from the application's own ongoing run loop.

Related Resources

Kubernetes: Init Containers

Open as page

The pattern

apiVersion: v1
kind: Pod
metadata:
  name: app-with-proxy
spec:
  containers:
    - name: app
      image: myapp:1.0
      ports:
        - containerPort: 8080
    - name: envoy-proxy      # sidecar
      image: envoyproxy/envoy:v1.28
      ports:
        - containerPort: 9901

Both containers share the Pod's single network namespace, so the sidecar can transparently intercept, inspect, or modify traffic to/from the main container (e.g., a service mesh proxy handling mTLS encryption and traffic routing) without the application itself needing any awareness that a proxy is involved — from the application's point of view, it just talks to localhost or receives connections normally.

Problems this pattern solves

Cross-cutting infrastructure concerns, applied uniformly, without touching application code. A service mesh (Istio, Linkerd) injects an Envoy sidecar into every Pod to handle mTLS, retries, circuit breaking, and traffic metrics — none of which the application code needs to implement itself, and all of which can be upgraded/reconfigured centrally without redeploying every application.
Log/metrics shipping — a sidecar that tails the main container's log files (via a shared volume) and forwards them to a centralized logging system, decoupling "how do I ship my logs" from the application's own code.
Dynamic configuration reloading — a sidecar that watches a ConfigMap-mounted file for changes and signals the main container to reload, without the main application needing to implement file-watching logic itself.
Ambassador/adapter pattern (a close relative) — a sidecar that simplifies how the main container talks to an external service, e.g., proxying a simple local connection to a complex external API with its own authentication.

Why this is better than baking the same logic into every application's image

Without sidecars, every team building an application would need to implement its own logging shipment, mTLS handling, and metrics export — duplicated effort, inconsistent implementations across teams, and a much larger blast radius when that cross-cutting logic needs to be updated (every application's image needs a rebuild, instead of just redeploying a shared sidecar image). The sidecar pattern lets a platform/infrastructure team own and evolve this logic centrally, injected uniformly across every application.

The formalization of "sidecar" as a first-class Kubernetes concept

Historically, "sidecar" was purely a convention — just a second container in a Pod, with no special Kubernetes-level distinction from the main container. Kubernetes 1.28+ introduced native support for explicitly marking a container as a sidecar (via restartPolicy: Always on an entry under initContainers), which gives it defined startup-ordering semantics (starts before the main containers, similar to an init container, but keeps running for the Pod's whole lifetime) and proper shutdown ordering (a true sidecar is terminated after the main containers during Pod shutdown, so it can keep shipping logs/metrics during the main container's graceful shutdown) — before this, achieving correct shutdown ordering for sidecars required careful manual workarounds.

The cost of the pattern

Every sidecar adds resource overhead (CPU/memory requests for a container running in every single Pod, multiplied across your whole fleet) and a bit of additional complexity to reason about (two containers' worth of logs and failure modes per Pod instead of one) — worth it for genuinely cross-cutting concerns that benefit from centralized ownership, but not a pattern to reach for casually for logic that's simple enough to just belong in the application itself.

Related Resources

Kubernetes: Sidecar Containers

Open as page

The five phases

Phase	Meaning
Pending	Pod accepted by the API server, but not all containers are running yet — could be waiting to be scheduled onto a node, waiting on an image pull, or waiting on a volume to attach
Running	The Pod has been bound to a node, and at least one container is running (others might be starting, restarting, or have already completed, for multi-container Pods)
Succeeded	All containers terminated successfully (exit code 0), and won't be restarted — the expected end state for a Job's Pod, not something you'd normally see for a Deployment's Pod
Failed	All containers have terminated, and at least one terminated with a non-zero exit code / was terminated by the system, and won't be restarted
Unknown	The Pod's state couldn't be determined, typically because the node hosting it stopped communicating with the control plane

Why "phase" alone is often not enough to diagnose a problem

A Pod stuck in Pending could mean several very different things: no node has enough free resources to satisfy the Pod's requests, no node matches its affinity/taint requirements, the image is still being pulled, or a required volume hasn't attached yet. The phase alone doesn't distinguish these — you need to look at the Pod's conditions and events for the actual reason:

kubectl describe pod <pod-name>
# Look at the "Conditions" section and "Events" at the bottom --
# these contain the actual human-readable reason, e.g.:
# "0/3 nodes are available: 3 Insufficient memory."

Pod conditions — more granular than phase

Alongside phase, a Pod has several boolean conditions, each with its own status and reason: PodScheduled (has it been assigned to a node), Initialized (have all init containers completed), ContainersReady (are all containers passing their readiness probes), and Ready (is the overall Pod ready to receive traffic — this is what a Service uses to decide whether to route to this Pod). A Pod can be in phase Running while its Ready condition is False — e.g., the containers are executing, but a readiness probe is failing, so the Pod isn't yet added as a Service endpoint. This distinction — running but not ready — is one of the most common sources of "why isn't my Pod receiving traffic even though it shows Running" confusion.

Container-level states, within a Running Pod

Each individual container within a Pod also has its own state: Waiting (not yet running — e.g., still pulling its image, or blocked on a CrashLoopBackOff backoff timer), Running, or Terminated (exited, with a reason and exit code). kubectl describe pod shows each container's individual state separately, which is essential for multi-container Pods where one container might be healthy while another is crash-looping.

When diagnosing a Pod problem, always look past the top-level phase — kubectl describe pod (conditions + events) and kubectl get pod -o yaml (full status detail, including per-container state) give the actual specific reason, and are the correct starting point for any of the common failure states covered in the observability/troubleshooting topic (CrashLoopBackOff, ImagePullBackOff, OOMKilled).

Related Resources

Kubernetes: Pod Lifecycle

Open as page

The problem it solves

Deployment "web" has 3 replicas, spread across 3 nodes.
An administrator needs to drain (empty and take offline) 2 of those 3 nodes
for maintenance, one after another.

Without a PDB: nothing stops the drain from evicting Pods on both nodes in
quick succession, potentially leaving only 1 (or even 0, if timed unluckily
with a rolling restart) of the 3 replicas available at once -- a real,
avoidable capacity/availability hit during routine maintenance.

Defining a PDB

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2          # at least 2 of the "web" Pods must remain available
  # or: maxUnavailable: 1  # at most 1 may be unavailable at a time
  selector:
    matchLabels:
      app: web

With minAvailable: 2 on a Deployment with 3 replicas, a node drain (which uses the Eviction API to voluntarily remove Pods) will only be allowed to evict one web Pod at a time — attempting to evict a second concurrently, while the first's replacement isn't yet up and Ready, is blocked until enough Pods are available again.

Voluntary vs. involuntary disruptions — the key distinction

A PDB only governs voluntary disruptions — actions requested through the Eviction API, which respects PDBs: kubectl drain, a cluster autoscaler scaling down a node, a manual eviction. It has no effect on involuntary disruptions — a node crashing unexpectedly, a kernel panic, a hardware failure, or the node simply becoming unreachable. There's no way to "budget" for a sudden, unplanned failure; PDBs are specifically about giving routine, planned maintenance operations a safety constraint to respect.

Why this matters for cluster upgrades and autoscaling

Cluster upgrades typically work by draining and replacing nodes one at a time (or in small batches) — a properly configured PDB is what allows this process to proceed automatically and safely without an administrator needing to manually watch and time each node's drain to avoid taking down too much of any one application's capacity simultaneously. Similarly, a Cluster Autoscaler (see the scheduling topic) scaling down underutilized nodes respects PDBs when deciding which nodes it's safe to drain and remove.

A common misconfiguration

Setting minAvailable equal to (or maxUnavailable: 0 with) the total replica count effectively blocks all voluntary disruptions entirely — a node can never be drained if doing so would evict any Pod of that application, since evicting even one would violate the budget. This can silently prevent cluster upgrades or node maintenance from ever completing for that application, which is usually not the intended outcome — PDBs should be set to allow some disruption (typically enough to preserve real availability, e.g., "keep at least 2 of 3 available," not "keep all 3 available at all times").

Define a PDB for any application where losing more than a small number of replicas at once would cause a real availability problem — this is a cheap, low-effort safeguard that pays for itself the first time a node drain or upgrade would otherwise have accidentally taken down too many replicas of a critical service simultaneously.

Related Resources

Kubernetes: Disruptions