How does a rolling update work, and how do you roll back a bad deployment?
Quick Answer
A rolling update gradually replaces Pods running the old version with Pods running the new version, controlled by `maxUnavailable` (how many old Pods can be taken down before their replacements are ready) and `maxSurge` (how many extra new Pods can be created above the desired count during the transition) — new Pods only start receiving traffic once they pass their readiness probe. Rolling back is `kubectl rollout undo deployment/<name>`, which re-applies a previous revision's Pod template and performs the same gradual rolling process in reverse.
Detailed Answer
The default strategy: RollingUpdate
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # at most 1 fewer than desired can be unavailable during the rollout
maxSurge: 1 # at most 1 MORE than desired can exist during the rollout
With replicas: 4, maxUnavailable: 1, maxSurge: 1: the Deployment controller can have as few as 3 (4 - 1) and as many as 5 (4 + 1) total Pods (old + new combined) at any moment during the rollout. It creates a new Pod (from the new ReplicaSet), waits for it to pass its readiness probe (see the probes question) before considering it available, then terminates one old Pod — repeating until all old Pods are replaced.
Start: [old][old][old][old] (4 old, 0 new)
Step 1: [old][old][old][old][new] (surge: 5 total, new not ready yet)
Step 2: [old][old][old][new✓] (new became ready, one old terminated: 4 total)
Step 3: [old][old][old][new✓][new] (surge again: 5 total)
Step 4: [old][old][new✓][new✓] (another old terminated: 4 total)
...continues until all 4 are the new version
Why readiness probes are essential to a safe rollout
The rollout only proceeds to terminate an old Pod once a new Pod is marked Ready by its readiness probe (see the probes question) — if the new version has a bug that causes it to fail its readiness check (e.g., it crashes on startup, or can't connect to a dependency), the rollout stalls rather than continuing to replace healthy old Pods with broken new ones. This is a critical safety property: without correctly configured readiness probes, a rolling update has no way to detect a bad new version and will happily replace every healthy Pod with broken ones.
Rolling back
kubectl rollout status deployment/web # watch a rollout's progress
kubectl rollout history deployment/web # see past revisions
kubectl rollout undo deployment/web # roll back to the previous revision
kubectl rollout undo deployment/web --to-revision=3 # roll back to a specific revision
rollout undo works by re-pointing the Deployment at a previous ReplicaSet's Pod template (retained, scaled to zero, from an earlier rollout — up to spec.revisionHistoryLimit, default 10) and performing the same gradual rolling process, just in the opposite direction — scaling the old (soon-to-be-current-again) ReplicaSet up while scaling the currently-bad one down. This means rollback gets the exact same safety properties (readiness-gated, gradual) as a forward rollout.
Pausing a rollout mid-flight
kubectl rollout pause deployment/web # freeze the rollout at its current state
kubectl rollout resume deployment/web # continue it
Useful for making several related changes to a Deployment's spec (e.g., updating both the image and a resource limit) without triggering a rollout after each individual edit — pause, make all your changes, then resume to trigger exactly one coordinated rollout.
What rolling updates don't protect against
A rolling update only protects against a new version that fails its readiness probe — a bug that passes readiness checks but causes incorrect behavior under real production traffic (a subtle logic error, a slow memory leak) won't be caught by the rollout mechanism itself. This is exactly the gap that canary deployments and more sophisticated progressive-delivery tooling (see the production operations topic) are designed to close.