How do you back up and restore etcd, and why does this matter?

Detailed Answer

This builds directly on the earlier fundamentals-topic etcd question's core point: etcd is the only genuinely stateful control-plane component, and losing it without a backup is catastrophic.

Taking a snapshot

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

This produces a single, consistent point-in-time snapshot of etcd's entire key space — every Kubernetes object's current state, at that moment. Snapshots should be taken on a regular schedule (commonly via a CronJob or an external scheduler), and — critically — copied off to storage separate from the etcd nodes themselves (object storage, a separate backup system) so that a failure affecting the etcd nodes (disk failure, an entire node/VM being destroyed) doesn't also destroy the backup sitting right next to it.

Restoring from a snapshot

ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20240115.db \
  --data-dir=/var/lib/etcd-restored

Restoring isn't simply "load the file back into a running etcd" — it typically involves stopping the etcd process, restoring the snapshot into a fresh data directory, and reconfiguring the control plane to use it (details vary by whether you're restoring a single-node etcd or reconstructing a multi-node etcd cluster's membership) — this is genuinely one of the more operationally delicate procedures in running Kubernetes, which is precisely why testing it matters so much.

Why testing the restore procedure is non-negotiable

An untested backup is not, for practical purposes, a real backup — corruption, an incomplete or silently-failing backup script, or unfamiliarity with the actual restore steps under real incident pressure are all common, realistic failure modes that only surface when you actually attempt a restore. Regularly scheduled restore drills — restoring a snapshot into an isolated test environment and confirming the resulting cluster state is actually correct and complete — are the only way to have genuine confidence the backup strategy works when it's actually needed, not just when it's assumed to.

What losing etcd without a backup actually means

Every other control plane component can be restarted or rebuilt and will simply resume operating once it can talk to etcd again — but if etcd's data itself is gone, there is no other copy of the cluster's state anywhere inside the cluster. Recovery becomes entirely dependent on whatever configuration exists outside the cluster: version-controlled YAML manifests, Helm chart values, a GitOps repository (see that question) that a tool like ArgoCD could use to reconstruct the cluster's desired state from scratch. This is a strong practical argument for GitOps and infrastructure-as-code more broadly — a cluster whose full desired state is captured in git can be substantially rebuilt even from a total etcd loss, while a cluster whose configuration only ever existed as ad-hoc kubectl commands run manually over time has no such recovery path at all.

Managed vs. self-hosted responsibility

Managed Kubernetes services (EKS, GKE, AKS) handle etcd backup and control-plane resilience as part of the managed offering — this is one of the more significant operational burdens teams take on themselves when choosing to self-host a cluster via kubeadm or similar, and worth weighing explicitly (see the managed-vs-self-hosted question) as part of that decision.

Being able to explain not just how to run the snapshot/restore commands, but why off-cluster storage and regular restore testing specifically matter, and connecting this to GitOps as a complementary recovery strategy, demonstrates real operational maturity beyond memorized etcdctl syntax.