The general order: control plane, then nodes
Kubernetes supports the control plane running a newer minor version than the kubelets on worker nodes (within a supported skew, typically up to a few minor versions, per the official version skew policy) — this is precisely what allows upgrading the control plane first without needing to simultaneously upgrade every node, spreading the upgrade out safely rather than requiring one enormous, all-at-once cutover.
Step 1: check for deprecated/removed APIs before upgrading
Kubernetes deprecates and eventually removes old API versions on a predictable schedule (a notable historical example: many extensions/v1beta1 and apps/v1beta1 resources were removed in 1.16) — a manifest or a controller still using a removed API version will simply fail outright on the new version, not just print a warning. Tools like kubectl-convert, pluto, or kube-no-trouble scan a cluster's actual running manifests (and Helm charts) for use of soon-to-be-removed or already-removed API versions, letting you fix these proactively before the upgrade rather than discovering the breakage during or after it.
Step 2: upgrade the control plane
On a managed service (EKS, GKE, AKS), this is typically a single action the cloud provider handles largely automatically and with minimal disruption, since the control plane's own components (API server, etcd, scheduler) are separate from where application workloads actually run. On a self-managed cluster (kubeadm), this means upgrading each control-plane node's components in sequence, ensuring etcd quorum is maintained throughout (never upgrading so many control-plane nodes simultaneously that you lose quorum — see the etcd question).
Step 3: upgrade worker nodes, one at a time (or in small batches)
kubectl cordon node-1 # mark node-1 as unschedulable -- no NEW pods land here
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# ... drain evicts existing Pods, respecting PodDisruptionBudgets (see that question) ...
# ... upgrade node-1's kubelet/OS, or replace it entirely with a new node ...
kubectl uncordon node-1 # allow new pods to be scheduled here again
cordonmarks a node unschedulable for new Pods, without touching Pods already running there — a preparatory step.drainadditionally evicts existing Pods from the node (via the Eviction API, respecting PodDisruptionBudgets), letting their controllers (Deployments, StatefulSets) reschedule them onto other, still-available nodes before this one is taken offline for the upgrade.- Repeating this one node (or a small batch) at a time, rather than draining every node simultaneously, is what keeps the application actually available throughout the process — assuming applications have enough replicas spread across enough nodes, and correctly configured PodDisruptionBudgets, to tolerate losing one node's worth of capacity at a time.
Why this is often easier on managed cloud clusters
Managed Kubernetes offerings frequently provide a largely automated node-pool upgrade process that performs exactly this cordon/drain/replace sequence across a node pool with configurable batching/surge settings, substantially reducing the manual orchestration burden compared to a fully self-managed kubeadm cluster, where an operator (or custom automation) is responsible for sequencing this correctly.
Rollback and canary-style caution for major version jumps
Kubernetes officially supports upgrading one minor version at a time (e.g., 1.27 → 1.28 → 1.29, not skipping straight from 1.27 to 1.29) — skipping versions isn't supported and can produce unpredictable results. For especially risk-sensitive environments, some teams first validate an upgrade against a staging cluster running representative workloads before applying it to production, treating a cluster upgrade with the same caution as a major application deployment, not a routine background task.
Mentioning cordon/drain specifically (not just "upgrade the nodes"), the API-deprecation-checking step, and the one-minor-version-at-a-time constraint together demonstrate real hands-on cluster operations experience, rather than only conceptual familiarity with the idea that clusters need periodic upgrades.
Related Resources
This builds directly on the earlier fundamentals-topic etcd question's core point: etcd is the only genuinely stateful control-plane component, and losing it without a backup is catastrophic.
Taking a snapshot
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
This produces a single, consistent point-in-time snapshot of etcd's entire key space — every Kubernetes object's current state, at that moment. Snapshots should be taken on a regular schedule (commonly via a CronJob or an external scheduler), and — critically — copied off to storage separate from the etcd nodes themselves (object storage, a separate backup system) so that a failure affecting the etcd nodes (disk failure, an entire node/VM being destroyed) doesn't also destroy the backup sitting right next to it.
Restoring from a snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115.db \
--data-dir=/var/lib/etcd-restored
Restoring isn't simply "load the file back into a running etcd" — it typically involves stopping the etcd process, restoring the snapshot into a fresh data directory, and reconfiguring the control plane to use it (details vary by whether you're restoring a single-node etcd or reconstructing a multi-node etcd cluster's membership) — this is genuinely one of the more operationally delicate procedures in running Kubernetes, which is precisely why testing it matters so much.
Why testing the restore procedure is non-negotiable
An untested backup is not, for practical purposes, a real backup — corruption, an incomplete or silently-failing backup script, or unfamiliarity with the actual restore steps under real incident pressure are all common, realistic failure modes that only surface when you actually attempt a restore. Regularly scheduled restore drills — restoring a snapshot into an isolated test environment and confirming the resulting cluster state is actually correct and complete — are the only way to have genuine confidence the backup strategy works when it's actually needed, not just when it's assumed to.
What losing etcd without a backup actually means
Every other control plane component can be restarted or rebuilt and will simply resume operating once it can talk to etcd again — but if etcd's data itself is gone, there is no other copy of the cluster's state anywhere inside the cluster. Recovery becomes entirely dependent on whatever configuration exists outside the cluster: version-controlled YAML manifests, Helm chart values, a GitOps repository (see that question) that a tool like ArgoCD could use to reconstruct the cluster's desired state from scratch. This is a strong practical argument for GitOps and infrastructure-as-code more broadly — a cluster whose full desired state is captured in git can be substantially rebuilt even from a total etcd loss, while a cluster whose configuration only ever existed as ad-hoc kubectl commands run manually over time has no such recovery path at all.
Managed vs. self-hosted responsibility
Managed Kubernetes services (EKS, GKE, AKS) handle etcd backup and control-plane resilience as part of the managed offering — this is one of the more significant operational burdens teams take on themselves when choosing to self-host a cluster via kubeadm or similar, and worth weighing explicitly (see the managed-vs-self-hosted question) as part of that decision.
Being able to explain not just how to run the snapshot/restore commands, but why off-cluster storage and regular restore testing specifically matter, and connecting this to GitOps as a complementary recovery strategy, demonstrates real operational maturity beyond memorized etcdctl syntax.
Related Resources
The traditional CI/CD "push" model
CI pipeline (on a code merge):
→ builds image
→ runs kubectl apply / helm upgrade DIRECTLY against the cluster
In this model, the CI system needs direct write credentials to the production cluster, and the cluster's actual state can drift from what's in git if someone runs an ad-hoc kubectl edit or kubectl apply outside the pipeline — there's no ongoing verification that the cluster still matches what's declared in source control after the initial deploy.
The GitOps "pull" model
Git repository (source of truth for desired state)
▲
│ (continuously watched/pulled)
│
ArgoCD/Flux controller, running INSIDE the cluster
│
▼
Cluster's actual state, continuously reconciled to match git
Instead of CI pushing changes into the cluster, a controller running inside the cluster (or with cluster access) continuously watches a designated git repository and applies whatever it finds there — and, crucially, keeps re-applying it on an ongoing basis, not just once at deploy time. This means the GitOps controller will detect and automatically correct any drift — if someone manually edits a Deployment directly with kubectl, the next reconciliation cycle reverts it back to match git, since git (not whatever's currently running) is the authoritative source of truth.
Why this maps naturally onto Kubernetes's own reconciliation model
This is precisely the same control-loop pattern that runs through the rest of Kubernetes (see the fundamentals topic's reconciliation question) — ArgoCD/Flux are, at their core, another reconciliation controller, just operating one level up: instead of reconciling "desired replica count" against "actual running Pods" the way a Deployment controller does, they reconcile "desired cluster state, as declared in git" against "actual cluster state," continuously.
Security and operational benefits
- No CI system needs direct cluster write credentials — the GitOps controller, running inside the cluster, pulls changes rather than the CI pipeline pushing them in, meaningfully shrinking the set of external systems with standing write access to production.
- Git history is the audit trail and rollback mechanism — reverting a bad change is a normal
git revert, and the GitOps controller picks up and applies the reverted state automatically, without needing separate rollback tooling. - Drift detection and correction — manual, undocumented changes made directly against the cluster are automatically reverted (or at least flagged, depending on configuration) rather than silently persisting and diverging from what's documented in git.
- A single, reviewable place to see a cluster's entire intended configuration — anyone can read the git repository to understand exactly what should be running, without needing direct cluster access at all.
ArgoCD vs. Flux, briefly
Both accomplish the same core GitOps loop; ArgoCD is commonly noted for its rich built-in web UI (visualizing application sync status, diffs, and health directly), while Flux is often used more as a set of composable, script/CLI-friendly controllers, frequently favored in setups leaning more heavily on command-line/automation-first workflows. Both are CNCF projects with broad adoption, and the choice between them is often driven more by team preference and ecosystem fit than a sharp functional gap.
Explaining GitOps as "continuous reconciliation against git as the source of truth," and explicitly connecting it back to the same control-loop pattern that underlies the rest of Kubernetes, demonstrates a deeper conceptual grasp than simply describing it as "using git for deployments."
Related Resources
Blue-green — two full environments, instant cutover
# "Blue" (current, live) Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-blue
spec:
replicas: 5
selector:
matchLabels:
app: web
version: blue
template:
metadata:
labels:
app: web
version: blue
spec:
containers:
- image: myapp:1.0
# "Green" (new version), deployed ALONGSIDE blue, not yet receiving traffic
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-green
spec:
replicas: 5
selector:
matchLabels:
app: web
version: green
template:
metadata:
labels:
app: web
version: green
spec:
containers:
- image: myapp:2.0
# Service currently points at "blue" -- cutover happens by changing this selector
apiVersion: v1
kind: Service
metadata:
name: web
spec:
selector:
app: web
version: blue # <-- change this to "green" to cut over ALL traffic at once
Both full versions run simultaneously (doubling resource usage temporarily), and the switch is a single, near-instant change to the Service's selector — every request after the switch goes to the new version, and rollback is equally instant (just switch the selector back). There's no gradual traffic-splitting phase — it's fully off or fully on for each version.
Canary — gradual, partial traffic shift
# 90% of traffic to the stable version, 10% to the canary --
# this typically requires something beyond a plain Kubernetes Service,
# since Services load-balance evenly across all matching endpoints,
# not by a configurable percentage split.
Plain Kubernetes Services have no native concept of "send 10% of traffic here, 90% there" — achieving genuine percentage-based traffic splitting requires additional tooling:
- Ingress controllers with canary annotations (e.g., NGINX Ingress's
nginx.ingress.kubernetes.io/canary-weightannotation) — a simpler, Ingress-layer approach. - Service mesh traffic splitting (Istio's
VirtualServiceweighted routing, Linkerd's traffic split) — more powerful, supporting fine-grained routing rules (by header, by percentage, gradually shifting over time) at the service-mesh layer. - Progressive delivery controllers (Argo Rollouts, Flagger) — purpose-built tools that automate the entire canary process: gradually shifting traffic percentages on a schedule, automatically querying metrics (error rate, latency) at each step, and automatically rolling back if the canary's metrics look worse than the stable version's — without needing to manually watch and adjust percentages by hand.
# Simplified Argo Rollouts canary strategy concept
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10 # send 10% of traffic to the new version
- pause: {duration: 10m} # wait, observe metrics
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100 # fully cut over
Comparing the two
| Blue-green | Canary | |
|---|---|---|
| Resource cost during rollout | Double (both full versions running) | Only slightly more (canary is usually a small extra replica count) |
| Rollback speed | Instant (flip the selector back) | Also fast, but may involve gradually shifting traffic back |
| Real production traffic validation before full commit | No — it's all-or-nothing at cutover | Yes — the whole point is validating under a real (if limited) slice of production traffic |
| Native Kubernetes support | Achievable with plain Services + relabeling | Requires additional tooling (Ingress annotations, service mesh, or Argo Rollouts/Flagger) |
| Best for | Simpler risk profiles, or when partial-traffic validation isn't feasible/needed | Higher-risk changes where gradually validating real user impact before full rollout matters |
Why rolling updates alone (the Deployment default) aren't the same as either
A standard rolling update (see the workload controllers topic) gradually replaces old Pods with new ones, but every Pod — old or new — receives an equal, undifferentiated share of traffic throughout the process; there's no deliberate "send only a controlled fraction of traffic to the new version and watch its metrics before proceeding" logic the way a true canary strategy has, and no "two full parallel environments, instant atomic switch" behavior the way blue-green has. Rolling updates are a reasonable default for many workloads, but blue-green and canary are deliberate strategies reached for when the additional safety/control they provide is specifically worth the added complexity.
Related Resources
Why multi-tenancy is a spectrum, not a single feature
"Multiple teams share a cluster" can mean anything from "trusted internal teams, mostly worried about accidentally stepping on each other's resources" to "mutually untrusted third-party customers running arbitrary code, where a security breach of one tenant must never affect another" — Kubernetes provides building blocks at several different strength levels, and choosing the right combination depends entirely on how much you actually trust the tenants sharing the cluster.
Layer 1: Namespaces — logical separation
Covered in the fundamentals topic — namespaces scope object names and are the boundary RBAC and quotas attach to. This alone provides organizational separation (Team A's web Deployment doesn't collide with Team B's web Deployment) but, critically, no network or security isolation by default — Pods in different namespaces can freely reach each other unless NetworkPolicies say otherwise.
Layer 2: ResourceQuotas and LimitRanges — preventing resource monopolization
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-a-quota
namespace: team-a
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
pods: "50"
A ResourceQuota caps the total resources (or object counts) a namespace can consume — without this, one tenant's namespace could, in principle, consume so much of the shared cluster's capacity that other tenants' workloads can't be scheduled at all (a "noisy neighbor" problem). A LimitRange complements this by setting default and min/max per-container resource requests/limits within a namespace, so individual Pods can't be created with unreasonably large (or entirely unset) resource requests even before the aggregate quota is hit.
Layer 3: NetworkPolicies — restricting cross-tenant traffic
As covered in the networking topic, a default-deny NetworkPolicy posture per tenant namespace, with explicit allow rules only for legitimate cross-namespace traffic, prevents one tenant's compromised or misbehaving workload from freely reaching another tenant's Pods over the network — closing the gap that plain namespace separation alone leaves wide open.
Layer 4: RBAC — restricting cross-tenant API access
Namespace-scoped Roles and RoleBindings (see the security topic) ensure Team A's credentials/ServiceAccounts have no permission to read or modify Team B's objects at the API level, even though both share the same cluster and API server.
Layer 5: stronger isolation for genuinely untrusted tenants
For scenarios where tenants are truly mutually distrustful (a SaaS platform running arbitrary customer workloads, or strict regulatory/compliance separation requirements), namespace-level isolation alone is generally considered insufficient — the Linux kernel underlying every container on a node is still shared, and container escape vulnerabilities, while not routine, do exist. Stronger options include:
- Dedicated node pools per tenant (via taints/tolerations and node affinity — see that question), ensuring no two tenants' Pods ever share the same physical/virtual machine.
- Sandboxed container runtimes (gVisor, Kata Containers) that provide a stronger isolation boundary than a standard container runtime, at some performance cost.
- Separate clusters per tenant — the strongest isolation available, at the cost of significantly higher operational overhead (managing many clusters instead of one) and losing the resource-pooling efficiency of a shared cluster.
Match the isolation strategy to the actual trust level between tenants: internal teams within one organization, generally trusting each other but wanting organizational and resource separation, are usually well served by namespaces + quotas + RBAC + NetworkPolicies on a shared cluster. Genuinely untrusted external tenants, or workloads with strict compliance/regulatory separation requirements, usually warrant node-level or cluster-level isolation instead — treating namespace-only separation as sufficient for that case is a common, real security misjudgment.