How do you safely upgrade a Kubernetes cluster with minimal downtime?

Upgrade the control plane first (typically non-disruptive to running workloads on managed services, since the control plane is separate from where application Pods run), then upgrade worker nodes one at a time (or in small batches) by cordoning and draining each node — moving its Pods elsewhere safely, respecting PodDisruptionBudgets — before upgrading or replacing it, then uncordoning it to rejoin the pool. Always check the specific version's changelog for deprecated/removed APIs before upgrading, since a cluster using a removed API version will break, not just warn.

How do you back up and restore etcd, and why does this matter?

Take regular, automated point-in-time snapshots of etcd's data using `etcdctl snapshot save`, store those snapshots somewhere durable and separate from the etcd nodes themselves, and periodically test actually restoring from a snapshot (an untested backup isn't a real backup). Since etcd holds the entire cluster's state — every object, every Secret — losing it without a working backup means effectively losing the cluster's whole configuration, recoverable only from whatever manifests/Helm charts/GitOps repositories happen to exist outside the cluster.

What is GitOps, and how do tools like ArgoCD/Flux fit into a Kubernetes deployment pipeline?

GitOps is an operating model where a git repository is the single source of truth for a cluster's desired state, and a controller running inside (or alongside) the cluster continuously reconciles the cluster's actual state to match whatever is committed in that repository — rather than a CI pipeline pushing changes into the cluster directly. ArgoCD and Flux are the two most widely used GitOps controllers, both watching a git repo and applying (and, importantly, continuously re-applying/correcting) whatever manifests it contains.

What's the difference between a blue-green and a canary deployment, and how do you implement each in Kubernetes?

A blue-green deployment runs the new version fully alongside the old version, then switches all traffic over in one atomic cutover (typically by updating a Service's label selector) — rollback is instant, but requires double the resources temporarily and offers no gradual, partial-traffic validation. A canary deployment gradually shifts a small, increasing percentage of traffic to the new version while most traffic still goes to the old one, letting you validate the new version under a limited slice of real production traffic before fully committing — typically requiring a service mesh or Ingress controller with traffic-splitting capability, since Kubernetes's core Deployment rolling update mechanism doesn't natively support percentage-based traffic splitting on its own.

What is multi-tenancy in Kubernetes, and what mechanisms support it?

Multi-tenancy means multiple distinct teams, applications, or customers safely share the same underlying cluster infrastructure. Kubernetes supports this at increasing levels of isolation: Namespaces (logical separation of objects and RBAC scope), ResourceQuotas/LimitRanges (preventing one tenant from consuming all shared capacity), NetworkPolicies (restricting cross-tenant network traffic), and, for genuinely untrusted or compliance-sensitive tenants, stronger isolation like separate node pools, dedicated clusters per tenant, or sandboxed container runtimes — since namespaces alone provide only logical, not hard security, separation.

What are ResourceQuotas and LimitRanges, and how do they differ from pod-level requests/limits?

A ResourceQuota caps the aggregate resource consumption (and/or object counts) allowed within an entire namespace — the total across every Pod, not any single one. A LimitRange sets default, minimum, and maximum resource request/limit values *per container* within a namespace, filling in sensible defaults for Pods that don't specify their own and rejecting ones that fall outside allowed bounds. Individual Pod-level requests/limits (set in each Pod's own spec) are what these two namespace-level mechanisms constrain and default — they operate at a different scope, not as a replacement for per-Pod configuration.

How do you drain a node safely for maintenance?

First `kubectl cordon ` to mark it unschedulable (preventing new Pods from landing there while you work), then `kubectl drain ` to evict its existing Pods so their controllers reschedule them elsewhere — this respects PodDisruptionBudgets by default, and requires explicit flags (`--ignore-daemonsets`, `--delete-emptydir-data`) to handle Pods that would otherwise block the drain. Once maintenance is complete, `kubectl uncordon ` allows new Pods to be scheduled there again.

What are common cost-optimization strategies for a Kubernetes cluster?

Right-size resource requests/limits based on actual observed usage (over-requesting wastes reserved, paid-for capacity that sits idle), use the Cluster Autoscaler to scale node count down during low-demand periods rather than running peak capacity constantly, use cheaper compute options where appropriate (spot/preemptible instances for fault-tolerant workloads), consolidate underutilized nodes via bin-packing-aware scheduling, and monitor per-namespace/per-team resource usage to make cost visible and attributable rather than a single opaque cluster-wide bill.

Kubernetes in Production and Operations

Cluster upgrades, etcd backups, GitOps, progressive delivery, multi-tenancy, and cost management.

Difficulty

Open as page

The general order: control plane, then nodes

Kubernetes supports the control plane running a newer minor version than the kubelets on worker nodes (within a supported skew, typically up to a few minor versions, per the official version skew policy) — this is precisely what allows upgrading the control plane first without needing to simultaneously upgrade every node, spreading the upgrade out safely rather than requiring one enormous, all-at-once cutover.

Step 1: check for deprecated/removed APIs before upgrading

Kubernetes deprecates and eventually removes old API versions on a predictable schedule (a notable historical example: many extensions/v1beta1 and apps/v1beta1 resources were removed in 1.16) — a manifest or a controller still using a removed API version will simply fail outright on the new version, not just print a warning. Tools like kubectl-convert, pluto, or kube-no-trouble scan a cluster's actual running manifests (and Helm charts) for use of soon-to-be-removed or already-removed API versions, letting you fix these proactively before the upgrade rather than discovering the breakage during or after it.

Step 2: upgrade the control plane

On a managed service (EKS, GKE, AKS), this is typically a single action the cloud provider handles largely automatically and with minimal disruption, since the control plane's own components (API server, etcd, scheduler) are separate from where application workloads actually run. On a self-managed cluster (kubeadm), this means upgrading each control-plane node's components in sequence, ensuring etcd quorum is maintained throughout (never upgrading so many control-plane nodes simultaneously that you lose quorum — see the etcd question).

Step 3: upgrade worker nodes, one at a time (or in small batches)

kubectl cordon node-1              # mark node-1 as unschedulable -- no NEW pods land here
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# ... drain evicts existing Pods, respecting PodDisruptionBudgets (see that question) ...
# ... upgrade node-1's kubelet/OS, or replace it entirely with a new node ...
kubectl uncordon node-1             # allow new pods to be scheduled here again

cordon marks a node unschedulable for new Pods, without touching Pods already running there — a preparatory step.
drain additionally evicts existing Pods from the node (via the Eviction API, respecting PodDisruptionBudgets), letting their controllers (Deployments, StatefulSets) reschedule them onto other, still-available nodes before this one is taken offline for the upgrade.
Repeating this one node (or a small batch) at a time, rather than draining every node simultaneously, is what keeps the application actually available throughout the process — assuming applications have enough replicas spread across enough nodes, and correctly configured PodDisruptionBudgets, to tolerate losing one node's worth of capacity at a time.

Why this is often easier on managed cloud clusters

Managed Kubernetes offerings frequently provide a largely automated node-pool upgrade process that performs exactly this cordon/drain/replace sequence across a node pool with configurable batching/surge settings, substantially reducing the manual orchestration burden compared to a fully self-managed kubeadm cluster, where an operator (or custom automation) is responsible for sequencing this correctly.

Rollback and canary-style caution for major version jumps

Kubernetes officially supports upgrading one minor version at a time (e.g., 1.27 → 1.28 → 1.29, not skipping straight from 1.27 to 1.29) — skipping versions isn't supported and can produce unpredictable results. For especially risk-sensitive environments, some teams first validate an upgrade against a staging cluster running representative workloads before applying it to production, treating a cluster upgrade with the same caution as a major application deployment, not a routine background task.

Mentioning cordon/drain specifically (not just "upgrade the nodes"), the API-deprecation-checking step, and the one-minor-version-at-a-time constraint together demonstrate real hands-on cluster operations experience, rather than only conceptual familiarity with the idea that clusters need periodic upgrades.

Related Resources

Kubernetes: Upgrading kubeadm clusters

Open as page

This builds directly on the earlier fundamentals-topic etcd question's core point: etcd is the only genuinely stateful control-plane component, and losing it without a backup is catastrophic.

Taking a snapshot

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

This produces a single, consistent point-in-time snapshot of etcd's entire key space — every Kubernetes object's current state, at that moment. Snapshots should be taken on a regular schedule (commonly via a CronJob or an external scheduler), and — critically — copied off to storage separate from the etcd nodes themselves (object storage, a separate backup system) so that a failure affecting the etcd nodes (disk failure, an entire node/VM being destroyed) doesn't also destroy the backup sitting right next to it.

Restoring from a snapshot

ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115.db \
  --data-dir=/var/lib/etcd-restored

Restoring isn't simply "load the file back into a running etcd" — it typically involves stopping the etcd process, restoring the snapshot into a fresh data directory, and reconfiguring the control plane to use it (details vary by whether you're restoring a single-node etcd or reconstructing a multi-node etcd cluster's membership) — this is genuinely one of the more operationally delicate procedures in running Kubernetes, which is precisely why testing it matters so much.

Why testing the restore procedure is non-negotiable

An untested backup is not, for practical purposes, a real backup — corruption, an incomplete or silently-failing backup script, or unfamiliarity with the actual restore steps under real incident pressure are all common, realistic failure modes that only surface when you actually attempt a restore. Regularly scheduled restore drills — restoring a snapshot into an isolated test environment and confirming the resulting cluster state is actually correct and complete — are the only way to have genuine confidence the backup strategy works when it's actually needed, not just when it's assumed to.

What losing etcd without a backup actually means

Every other control plane component can be restarted or rebuilt and will simply resume operating once it can talk to etcd again — but if etcd's data itself is gone, there is no other copy of the cluster's state anywhere inside the cluster. Recovery becomes entirely dependent on whatever configuration exists outside the cluster: version-controlled YAML manifests, Helm chart values, a GitOps repository (see that question) that a tool like ArgoCD could use to reconstruct the cluster's desired state from scratch. This is a strong practical argument for GitOps and infrastructure-as-code more broadly — a cluster whose full desired state is captured in git can be substantially rebuilt even from a total etcd loss, while a cluster whose configuration only ever existed as ad-hoc kubectl commands run manually over time has no such recovery path at all.

Managed vs. self-hosted responsibility

Managed Kubernetes services (EKS, GKE, AKS) handle etcd backup and control-plane resilience as part of the managed offering — this is one of the more significant operational burdens teams take on themselves when choosing to self-host a cluster via kubeadm or similar, and worth weighing explicitly (see the managed-vs-self-hosted question) as part of that decision.

Being able to explain not just how to run the snapshot/restore commands, but why off-cluster storage and regular restore testing specifically matter, and connecting this to GitOps as a complementary recovery strategy, demonstrates real operational maturity beyond memorized etcdctl syntax.

Related Resources

Kubernetes: Backing up an etcd cluster

Open as page

The traditional CI/CD "push" model

CI pipeline (on a code merge):
   → builds image
   → runs kubectl apply / helm upgrade DIRECTLY against the cluster

In this model, the CI system needs direct write credentials to the production cluster, and the cluster's actual state can drift from what's in git if someone runs an ad-hoc kubectl edit or kubectl apply outside the pipeline — there's no ongoing verification that the cluster still matches what's declared in source control after the initial deploy.

The GitOps "pull" model

Git repository (source of truth for desired state)
        ▲
        │ (continuously watched/pulled)
        │
ArgoCD/Flux controller, running INSIDE the cluster
        │
        ▼
Cluster's actual state, continuously reconciled to match git

Instead of CI pushing changes into the cluster, a controller running inside the cluster (or with cluster access) continuously watches a designated git repository and applies whatever it finds there — and, crucially, keeps re-applying it on an ongoing basis, not just once at deploy time. This means the GitOps controller will detect and automatically correct any drift — if someone manually edits a Deployment directly with kubectl, the next reconciliation cycle reverts it back to match git, since git (not whatever's currently running) is the authoritative source of truth.

Why this maps naturally onto Kubernetes's own reconciliation model

This is precisely the same control-loop pattern that runs through the rest of Kubernetes (see the fundamentals topic's reconciliation question) — ArgoCD/Flux are, at their core, another reconciliation controller, just operating one level up: instead of reconciling "desired replica count" against "actual running Pods" the way a Deployment controller does, they reconcile "desired cluster state, as declared in git" against "actual cluster state," continuously.

Security and operational benefits

No CI system needs direct cluster write credentials — the GitOps controller, running inside the cluster, pulls changes rather than the CI pipeline pushing them in, meaningfully shrinking the set of external systems with standing write access to production.
Git history is the audit trail and rollback mechanism — reverting a bad change is a normal git revert, and the GitOps controller picks up and applies the reverted state automatically, without needing separate rollback tooling.
Drift detection and correction — manual, undocumented changes made directly against the cluster are automatically reverted (or at least flagged, depending on configuration) rather than silently persisting and diverging from what's documented in git.
A single, reviewable place to see a cluster's entire intended configuration — anyone can read the git repository to understand exactly what should be running, without needing direct cluster access at all.

ArgoCD vs. Flux, briefly

Both accomplish the same core GitOps loop; ArgoCD is commonly noted for its rich built-in web UI (visualizing application sync status, diffs, and health directly), while Flux is often used more as a set of composable, script/CLI-friendly controllers, frequently favored in setups leaning more heavily on command-line/automation-first workflows. Both are CNCF projects with broad adoption, and the choice between them is often driven more by team preference and ecosystem fit than a sharp functional gap.

Explaining GitOps as "continuous reconciliation against git as the source of truth," and explicitly connecting it back to the same control-loop pattern that underlies the rest of Kubernetes, demonstrates a deeper conceptual grasp than simply describing it as "using git for deployments."

Related Resources

ArgoCD Documentation

Open as page

Blue-green — two full environments, instant cutover

# "Blue" (current, live) Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: web
      version: blue
  template:
    metadata:
      labels:
        app: web
        version: blue
    spec:
      containers:
        - image: myapp:1.0

# "Green" (new version), deployed ALONGSIDE blue, not yet receiving traffic
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: web
      version: green
  template:
    metadata:
      labels:
        app: web
        version: green
    spec:
      containers:
        - image: myapp:2.0

# Service currently points at "blue" -- cutover happens by changing this selector
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
    version: blue    # <-- change this to "green" to cut over ALL traffic at once

Both full versions run simultaneously (doubling resource usage temporarily), and the switch is a single, near-instant change to the Service's selector — every request after the switch goes to the new version, and rollback is equally instant (just switch the selector back). There's no gradual traffic-splitting phase — it's fully off or fully on for each version.

Canary — gradual, partial traffic shift

# 90% of traffic to the stable version, 10% to the canary --
# this typically requires something beyond a plain Kubernetes Service,
# since Services load-balance evenly across all matching endpoints,
# not by a configurable percentage split.

Plain Kubernetes Services have no native concept of "send 10% of traffic here, 90% there" — achieving genuine percentage-based traffic splitting requires additional tooling:

Ingress controllers with canary annotations (e.g., NGINX Ingress's nginx.ingress.kubernetes.io/canary-weight annotation) — a simpler, Ingress-layer approach.
Service mesh traffic splitting (Istio's VirtualService weighted routing, Linkerd's traffic split) — more powerful, supporting fine-grained routing rules (by header, by percentage, gradually shifting over time) at the service-mesh layer.
Progressive delivery controllers (Argo Rollouts, Flagger) — purpose-built tools that automate the entire canary process: gradually shifting traffic percentages on a schedule, automatically querying metrics (error rate, latency) at each step, and automatically rolling back if the canary's metrics look worse than the stable version's — without needing to manually watch and adjust percentages by hand.

# Simplified Argo Rollouts canary strategy concept
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10      # send 10% of traffic to the new version
        - pause: {duration: 10m}   # wait, observe metrics
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100     # fully cut over

Comparing the two

	Blue-green	Canary
Resource cost during rollout	Double (both full versions running)	Only slightly more (canary is usually a small extra replica count)
Rollback speed	Instant (flip the selector back)	Also fast, but may involve gradually shifting traffic back
Real production traffic validation before full commit	No — it's all-or-nothing at cutover	Yes — the whole point is validating under a real (if limited) slice of production traffic
Native Kubernetes support	Achievable with plain Services + relabeling	Requires additional tooling (Ingress annotations, service mesh, or Argo Rollouts/Flagger)
Best for	Simpler risk profiles, or when partial-traffic validation isn't feasible/needed	Higher-risk changes where gradually validating real user impact before full rollout matters

Why rolling updates alone (the Deployment default) aren't the same as either

A standard rolling update (see the workload controllers topic) gradually replaces old Pods with new ones, but every Pod — old or new — receives an equal, undifferentiated share of traffic throughout the process; there's no deliberate "send only a controlled fraction of traffic to the new version and watch its metrics before proceeding" logic the way a true canary strategy has, and no "two full parallel environments, instant atomic switch" behavior the way blue-green has. Rolling updates are a reasonable default for many workloads, but blue-green and canary are deliberate strategies reached for when the additional safety/control they provide is specifically worth the added complexity.

Related Resources

Kubernetes: Blue/Green and Canary Deployments

Open as page

Why multi-tenancy is a spectrum, not a single feature

"Multiple teams share a cluster" can mean anything from "trusted internal teams, mostly worried about accidentally stepping on each other's resources" to "mutually untrusted third-party customers running arbitrary code, where a security breach of one tenant must never affect another" — Kubernetes provides building blocks at several different strength levels, and choosing the right combination depends entirely on how much you actually trust the tenants sharing the cluster.

Layer 1: Namespaces — logical separation

Covered in the fundamentals topic — namespaces scope object names and are the boundary RBAC and quotas attach to. This alone provides organizational separation (Team A's web Deployment doesn't collide with Team B's web Deployment) but, critically, no network or security isolation by default — Pods in different namespaces can freely reach each other unless NetworkPolicies say otherwise.

Layer 2: ResourceQuotas and LimitRanges — preventing resource monopolization

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    pods: "50"

A ResourceQuota caps the total resources (or object counts) a namespace can consume — without this, one tenant's namespace could, in principle, consume so much of the shared cluster's capacity that other tenants' workloads can't be scheduled at all (a "noisy neighbor" problem). A LimitRange complements this by setting default and min/max per-container resource requests/limits within a namespace, so individual Pods can't be created with unreasonably large (or entirely unset) resource requests even before the aggregate quota is hit.

Layer 3: NetworkPolicies — restricting cross-tenant traffic

As covered in the networking topic, a default-deny NetworkPolicy posture per tenant namespace, with explicit allow rules only for legitimate cross-namespace traffic, prevents one tenant's compromised or misbehaving workload from freely reaching another tenant's Pods over the network — closing the gap that plain namespace separation alone leaves wide open.

Layer 4: RBAC — restricting cross-tenant API access

Namespace-scoped Roles and RoleBindings (see the security topic) ensure Team A's credentials/ServiceAccounts have no permission to read or modify Team B's objects at the API level, even though both share the same cluster and API server.

Layer 5: stronger isolation for genuinely untrusted tenants

For scenarios where tenants are truly mutually distrustful (a SaaS platform running arbitrary customer workloads, or strict regulatory/compliance separation requirements), namespace-level isolation alone is generally considered insufficient — the Linux kernel underlying every container on a node is still shared, and container escape vulnerabilities, while not routine, do exist. Stronger options include:

Dedicated node pools per tenant (via taints/tolerations and node affinity — see that question), ensuring no two tenants' Pods ever share the same physical/virtual machine.
Sandboxed container runtimes (gVisor, Kata Containers) that provide a stronger isolation boundary than a standard container runtime, at some performance cost.
Separate clusters per tenant — the strongest isolation available, at the cost of significantly higher operational overhead (managing many clusters instead of one) and losing the resource-pooling efficiency of a shared cluster.

Match the isolation strategy to the actual trust level between tenants: internal teams within one organization, generally trusting each other but wanting organizational and resource separation, are usually well served by namespaces + quotas + RBAC + NetworkPolicies on a shared cluster. Genuinely untrusted external tenants, or workloads with strict compliance/regulatory separation requirements, usually warrant node-level or cluster-level isolation instead — treating namespace-only separation as sufficient for that case is a common, real security misjudgment.

Related Resources

Kubernetes: Multi-tenancy

Open as page

ResourceQuota — a namespace-wide ceiling

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "50"
    persistentvolumeclaims: "10"

This caps the sum total across every Pod in the team-a namespace — combined, they can request at most 20 CPU cores and 40Gi memory, have limits summing to at most 40 CPU cores and 80Gi memory, and the namespace can contain at most 50 Pods and 10 PVCs total. Once a quota is defined for a namespace, every Pod created in it must specify requests/limits for whichever resources the quota covers — a Pod with no requests/limits set would have no defined consumption to check against the quota, so Kubernetes requires them explicitly once a quota is in force for that resource type.

LimitRange — per-container defaults and bounds

apiVersion: v1
kind: LimitRange
metadata:
  name: team-a-limits
  namespace: team-a
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"          # applied if a container doesn't specify a limit
        memory: "512Mi"
      defaultRequest:
        cpu: "250m"           # applied if a container doesn't specify a request
        memory: "256Mi"
      min:
        cpu: "100m"            # reject any container requesting less than this
      max:
        cpu: "2"                # reject any container requesting more than this

A LimitRange operates at the individual container level, within a namespace — it fills in sensible default requests/limits for any container that doesn't specify its own (so a developer who forgets to set them doesn't end up with an unbounded, un-scheduled-predictably container — see the requests/limits question on why this matters), and enforces min/max bounds so no single container can be created with an unreasonably tiny or unreasonably huge resource request, independent of what the namespace's aggregate ResourceQuota allows overall.

How the three levels relate

Pod's own spec.resources.requests/limits   <- what an individual container declares (or, if
                                                omitted, what LimitRange's default fills in)
        ↓ constrained by
LimitRange (per-namespace)                  <- bounds and defaults for each individual container
        ↓ their sum constrained by
ResourceQuota (per-namespace)               <- an aggregate ceiling across the WHOLE namespace

A LimitRange operates on each container individually ("no single container may request more than 2 CPU"); a ResourceQuota operates on the namespace in aggregate ("all containers in this namespace, combined, may request at most 20 CPU total"). Both exist specifically because pod-level requests/limits alone — while essential for scheduling and per-container runtime enforcement (see that question) — provide no mechanism on their own to prevent either a single misconfigured container from requesting an unreasonable amount, or many individually-reasonable Pods from collectively consuming an entire shared cluster's capacity.

Why both matter for multi-tenancy specifically

Without a ResourceQuota, one tenant namespace could accumulate enough Pods (each individually reasonable) to starve every other tenant sharing the cluster of capacity — a "noisy neighbor" problem at the aggregate level. Without a LimitRange, a single developer forgetting to set requests/limits (or setting them absurdly high "just in case") in one Pod spec could itself cause scheduling problems or resource starvation, even within an otherwise well-quota'd namespace. Both mechanisms are typically deployed together as part of a real multi-tenancy strategy (see that question) — quotas capping the aggregate, LimitRanges keeping individual containers within sane, well-defaulted bounds.

Related Resources

Kubernetes: Resource Quotas

Open as page

Step 1: cordon — stop new Pods from landing here

kubectl cordon node-1

Marks the node as SchedulingDisabled — the scheduler will no longer consider it a candidate for new Pods, but existing Pods already running on it are completely unaffected by cordon alone. This is a good first step even before you're ready to actually drain, since it prevents the situation from getting worse (more Pods landing on a node you're about to take offline) while you prepare.

Step 2: drain — evict existing Pods safely

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

drain uses the Kubernetes Eviction API to gracefully remove every Pod from the node — this respects each Pod's terminationGracePeriodSeconds (allowing a clean shutdown) and, importantly, respects PodDisruptionBudgets (see that question), refusing to evict a Pod if doing so would violate its PDB, rather than forcing it through regardless.

Two flags are commonly required to avoid the drain command refusing to proceed:

--ignore-daemonsets — DaemonSet-managed Pods (see the workload controllers topic) are, by design, tied to running on every node, so they can't be meaningfully "evicted and rescheduled elsewhere" the way a Deployment's Pod can; without this flag, drain will refuse to proceed past a DaemonSet Pod.
--delete-emptydir-data — Pods using emptyDir volumes (see the storage topic) will lose that data when evicted (since emptyDir is node-local and ephemeral by design); without explicitly acknowledging this with the flag, drain refuses to proceed, forcing you to consciously accept the (usually expected and fine) data loss rather than it happening silently.

If a Pod's PodDisruptionBudget would be violated by evicting it, drain will wait and retry, rather than proceeding — this can cause a drain to appear "stuck," which is often a signal that a PDB is configured too strictly for the current situation (e.g., minAvailable equal to total replica count, blocking any eviction at all — see that question's common misconfiguration).

Step 3: perform the actual maintenance

With the node cordoned and drained (no Pods running on it, no new ones landing there), it's now safe to perform whatever maintenance is needed — an OS patch, a kubelet upgrade, hardware maintenance, or simply decommissioning the node entirely.

Step 4: uncordon — return it to service

kubectl uncordon node-1

Marks the node schedulable again — the scheduler will now consider it a normal candidate for new Pods. If the node is being decommissioned entirely rather than returning to service, you'd instead remove it from the cluster (kubectl delete node node-1, alongside actually decommissioning the underlying machine/VM) rather than uncordoning it.

What makes this "safe" in the first place

The entire process depends on the application workloads being drained actually being resilient to losing one node's worth of capacity — sufficient replica counts spread appropriately across nodes (ideally reinforced with pod anti-affinity — see that question), and PodDisruptionBudgets configured to prevent too many replicas being evicted simultaneously. Draining a node running the only replica of a critical, non-redundant application will cause a real outage for that application regardless of how carefully you follow the cordon/drain/uncordon sequence — the safety of node maintenance is ultimately a property of how the workloads themselves are architected, not just of the drain command's mechanics.

Automation at scale

For clusters with many nodes needing regular maintenance/upgrades (see the cluster-upgrade question), this cordon/drain/uncordon sequence is typically automated rather than run manually node-by-node — cloud-managed node pool upgrade features, or cluster-lifecycle tools, commonly implement exactly this sequence with configurable batching and safety checks built in.

Related Resources

Kubernetes: Safely Drain a Node

Open as page

Right-size resource requests (the highest-leverage, most commonly neglected lever)

Since the scheduler reserves capacity based on requests, not actual usage (see that question), a Pod requesting far more CPU/memory than it actually uses effectively reserves — and you pay for — capacity that then sits idle. This is extremely common in practice: developers often set generous, "safe-feeling" round-number requests once, early on, and never revisit them as real usage data becomes available.

A Pod requesting 2 CPU / 4Gi, but actually averaging 0.3 CPU / 800Mi usage,
wastes the difference across every node it's scheduled on, at whatever
scale that Pod is replicated to.

A Vertical Pod Autoscaler run in recommendation-only mode (see the scheduling topic) is a good, low-risk way to surface exactly how over- or under-provisioned real workloads actually are, based on measured historical usage, rather than guessing.

Scale nodes to match actual demand, not a fixed peak

The Cluster Autoscaler (see that question) reduces the number of running nodes during lower-demand periods (nights, weekends, off-peak hours for a workload with a predictable daily/weekly pattern) rather than permanently running enough nodes to handle peak load at all times — a cluster sized purely for its worst-case peak, all day every day, is paying for capacity that goes largely unused during the majority of typical off-peak time.

Use cheaper compute where the workload tolerates it

Cloud providers offer spot/preemptible instances at a significant discount (often 60-90% cheaper) compared to on-demand pricing, in exchange for the provider being able to reclaim that capacity with limited notice. This is an excellent fit for genuinely fault-tolerant, interruptible workloads (batch jobs, CI runners, stateless workers that can simply retry/reschedule elsewhere) — but a poor fit for workloads that can't tolerate abrupt termination (a database's only replica, a long-running job with no checkpointing). Many clusters run a mixed node pool strategy — a smaller baseline of stable on-demand nodes for critical/stateful workloads, plus a larger, more elastic spot-instance pool for tolerant workloads.

Bin-packing and consolidation

Spreading Pods thinly across many partially-utilized nodes (rather than densely packing them onto fewer, more fully-utilized nodes) means paying for more total node capacity than is actually being used. Some cluster autoscaling tools (and the underlying scheduler's own scoring — see that question) actively work to consolidate workloads onto fewer nodes where safely possible, and periodically identifying and removing (or resizing) chronically underutilized nodes is a standard cost-review practice.

Make cost visible and attributable

Without per-namespace or per-team cost visibility, cluster spend is often just one large, opaque line item with no clear ownership or accountability for waste. Tools like Kubecost or OpenCost attribute actual cloud spend down to individual namespaces, workloads, or teams (based on their actual resource requests/usage), making it possible to have a real conversation about which specific team's over-provisioned Pods are driving cost — cost visibility is often what actually motivates the right-sizing work described above, rather than right-sizing happening proactively without a concrete cost signal driving it.

Turning off genuinely unnecessary workloads

Non-production environments (dev, staging, ephemeral preview/PR environments) are a common, often-overlooked source of ongoing cost if they're left running 24/7 despite only being needed during working hours or actively-in-use periods — scheduled scale-down (or full teardown) of non-production environments outside their actual usage window is a simple, high-leverage cost lever many teams don't bother implementing.

A strong answer recognizes that cost optimization in Kubernetes is mostly about making real resource usage match what's actually requested/provisioned — the scheduler and autoscalers can only work efficiently with accurate signals, so the underlying discipline (measuring, right-sizing, making cost visible) matters more than any single specific tool or setting.

Related Resources

Kubernetes: Cluster Autoscaler