Kubernetes Fundamentals and Architecture

Difficulty

The problem before orchestration

Running a single container on a single machine is easy — docker run and you're done. The problem appears at scale: an application with a dozen services, each needing multiple replicas for availability, spread across many machines, needing to survive machine failures, needing to find and talk to each other, and needing to be updated without downtime. Doing this by hand — SSHing into machines, manually restarting crashed containers, manually editing load balancer configs when an instance moves — doesn't scale past a handful of containers, and is fragile and slow even then.

What Kubernetes actually automates

  • Scheduling — deciding which machine (node) in the cluster should run each container, based on available resources and constraints.
  • Self-healing — if a container crashes or a node dies, Kubernetes notices and starts replacement containers elsewhere, without a human intervening.
  • Service discovery and load balancing — containers get a stable way to find and talk to each other, even as individual instances are created and destroyed and move between nodes.
  • Rolling updates and rollbacks — deploying a new version of an application gradually, replacing old instances with new ones, and automatically reverting if something goes wrong.
  • Scaling — increasing or decreasing the number of running instances of an application, manually or automatically based on load.
  • Configuration and secret management — injecting configuration and sensitive values into applications without baking them into container images.

The core idea: declarative desired state

Rather than issuing imperative commands ("start this container on that machine"), you describe the desired state of your system in configuration ("I want 3 replicas of this application running") and Kubernetes continuously works to make the actual state match it — this reconciliation-loop model (covered in depth in a later question) is the central idea that everything else in Kubernetes builds on.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 3          # desired state: always keep 3 running
  template:
    spec:
      containers:
        - name: web
          image: myapp:1.4.0

If a node hosting one of these pods dies, Kubernetes notices the actual count has dropped below 3 and schedules a replacement — without anyone needing to notice the failure or take manual action.

Why this matters for interviews

A strong answer doesn't just define Kubernetes as "a container orchestrator" — it connects that definition to the concrete operational pain (manual scheduling, no self-healing, fragile networking, risky deploys) that existed before it, and names the reconciliation/desired-state model as the mechanism that makes the automation possible. This framing sets up nearly every other Kubernetes topic, since almost every object type in the system (Deployments, Services, PVCs) is just a different application of the same "describe what you want, a controller makes it true" pattern.

The four control plane components

                     ┌─────────────────────────────────┐
                     │          Control Plane            │
                     │                                    │
  kubectl ────────▶  │   API Server ──▶ etcd              │
                     │       ▲                            │
                     │       │                            │
                     │   Scheduler   Controller Manager    │
                     └─────────────────────────────────┘
                              │ (schedules pods onto, and
                              │  monitors, worker nodes)
                              ▼
                     Worker Nodes (kubelet, kube-proxy, runtime)

API server (kube-apiserver)

The single front door to the cluster — every read and write, whether from kubectl, a controller, or another component, goes through it as a REST API. It validates requests, enforces authentication/authorization (see the RBAC question), and is the only component that talks directly to etcd. Because it's stateless itself (all real state lives in etcd), it can be horizontally scaled behind a load balancer for high availability.

etcd

A distributed, consistent key-value store that holds the entire cluster's state — every object definition, every current status, all of it. It's built on the Raft consensus algorithm, which is what gives it strong consistency guarantees even when run as a multi-node cluster for high availability. Losing etcd (without a backup) means losing the cluster's entire state — which is why etcd backup/restore is a critical, non-optional production practice (see the operations topic).

Scheduler (kube-scheduler)

Watches the API server for newly created Pods that don't yet have a node assigned, and decides which node each should run on — based on resource requests, affinity/anti-affinity rules, taints/tolerations, and other constraints (see the scheduling topic). The scheduler only decides placement; it's the kubelet on the chosen node that actually starts the container.

Controller manager (kube-controller-manager)

Runs a collection of controllers, each responsible for a reconciliation control loop for one type of object — the Deployment controller ensures the right number of Pods exist, the Node controller notices when a node stops responding, and so on. Conceptually, these are many independent loops, each continuously comparing desired state (from etcd, via the API server) against observed actual state, and taking action to close any gap.

Why this separation matters

Each component has one narrow job, and they only communicate through the API server (never directly with each other or with etcd, except the API server itself) — this decoupling is what lets each component be replaced, scaled, or restarted independently without the others needing to know or care, and is a large part of why Kubernetes itself is resilient to individual component failures.

Managed vs. self-hosted

Cloud-managed Kubernetes (EKS, GKE, AKS) runs and maintains the entire control plane for you — you never see or manage etcd, the API server, or the scheduler directly, only interact with the API server's endpoint. Self-hosting a cluster (via kubeadm or similar) means you're responsible for standing up, securing, scaling, and backing up all four of these components yourself.

The three node-level components

   Worker Node
   ┌──────────────────────────────────────────┐
   │  kubelet ───────▶ Container Runtime        │
   │     ▲              (containerd / CRI-O)     │
   │     │              → actually runs Pods      │
   │     │ (talks to API server)                  │
   │  kube-proxy                                  │
   │     → maintains network rules for Services   │
   └──────────────────────────────────────────┘

kubelet

The primary agent on every node — it watches the API server for Pods assigned to its node, and ensures the containers described in each Pod's spec are actually running and healthy (starting them via the container runtime, restarting them if they crash, running liveness/readiness probes). It also reports the node's and its pods' status back to the API server, which is how kubectl get pods and kubectl get nodes show current state. The kubelet does not manage containers that weren't created through Kubernetes — it only manages what's described in the Pod specs assigned to it.

Container runtime

The software that actually pulls container images and runs containers — containerd and CRI-O are the two most common choices today. The kubelet talks to the runtime through a standard interface called the Container Runtime Interface (CRI), rather than being hardcoded to any one runtime (see that question for why Docker specifically was deprecated as a direct Kubernetes runtime, even though containers built with Docker still run fine).

kube-proxy

Maintains the network rules on each node that implement the Service abstraction (see the networking topic) — traditionally via iptables rules, though modern configurations increasingly use IPVS or eBPF-based approaches (like Cilium) for better performance at scale. When a Service is created or its backing Pods change, kube-proxy updates the node's networking rules so traffic sent to the Service's virtual IP gets routed to one of the actual healthy backing Pods.

Why nodes need all three, and the control plane doesn't run them

The control plane decides what should happen (desired state, scheduling decisions); worker nodes are where things actually run. The kubelet and container runtime are what turn a scheduling decision into an actual running container; kube-proxy is what turns a Service definition into actual working network routing on that node. Every node needs all three because every node needs to both run containers and participate correctly in cluster networking — the control plane components, by contrast, don't run application workloads at all (in most production setups) and so don't need them.

What happens if a node's kubelet stops reporting

The control plane's node controller notices the node has stopped sending heartbeats within a configured threshold, marks the node as NotReady, and — after a further grace period — Pods that were running on it are considered for rescheduling onto healthy nodes (assuming they're managed by a controller like a Deployment that maintains a desired replica count; a bare unmanaged Pod would simply be lost).

Related Resources

What etcd actually stores

Every Kubernetes object you create — every Deployment, Service, ConfigMap, Secret, Pod status — is ultimately stored as a key in etcd. The API server is the only component that talks to etcd directly; everything else (kubectl, the scheduler, controllers, kubelets) goes through the API server, which reads and writes to etcd on their behalf.

kubectl apply -f deployment.yaml
   → API server validates & authorizes
   → API server writes the Deployment object to etcd
   → Controller manager's Deployment controller, watching the API server,
     notices the new/changed object and creates matching ReplicaSets/Pods

Why Raft consensus matters

etcd is typically run as a cluster of an odd number of nodes (commonly 3 or 5) using the Raft consensus algorithm to agree on writes — a write is only considered committed once a majority (quorum) of etcd nodes have durably persisted it. This gives etcd strong consistency (every read reflects the most recently committed write) and tolerance of node failures (a 5-node etcd cluster can lose 2 nodes and keep operating, since 3 still form a majority) — but it also means etcd write latency is bounded by the slowest node needed to reach quorum, and etcd performance is quite sensitive to disk I/O latency and network latency between its nodes.

Why losing etcd is catastrophic

Every other control plane component is effectively stateless or easily reconstructible — the API server holds no state of its own, the scheduler and controllers can be restarted and will simply re-read current state from etcd (via the API server) and resume operating. But if etcd's data is lost or corrupted without a backup, there is no other copy of the cluster's state anywhere — every Deployment, Service, Secret, and their current status is simply gone, and the cluster must effectively be rebuilt from whatever configuration (YAML manifests, Helm charts, GitOps repositories) exists outside the cluster.

Backup and disaster recovery

# Take a point-in-time snapshot of etcd's data
etcdctl snapshot save backup.db

# Restore from a snapshot (typically as part of rebuilding a control plane node)
etcdctl snapshot restore backup.db

Regular, automated etcd snapshots — stored somewhere other than the etcd nodes themselves — combined with periodic restore testing (an untested backup isn't a real backup) is standard practice for any self-managed production cluster. Managed Kubernetes services (EKS, GKE, AKS) handle etcd backup and the entire control plane's resilience for you, which is one of the most significant operational burdens they take off a team's plate compared to self-hosting.

Security note

Because etcd holds every Secret's data (by default, unencrypted unless encryption-at-rest is explicitly configured — see the security topic), direct network access to etcd must be tightly restricted to the control plane components that need it, and encryption at rest should be enabled for any cluster storing genuinely sensitive Secret data.

The API server as the single front door

Every interaction with a Kubernetes cluster — whether a human running kubectl get pods, a controller watching for changes, or the scheduler assigning a Pod to a node — happens through the API server's REST endpoints. Nothing in the cluster (other than the API server itself) talks directly to etcd.

kubectl get pods -n default
   → kubectl sends: GET https://<api-server>/api/v1/namespaces/default/pods
   → API server: authenticates the request, checks RBAC authorization,
     reads matching Pod objects from etcd, returns JSON
   → kubectl formats the JSON response as the human-readable table you see

The request pipeline

Every request passes through several stages: authentication (who are you — client certificate, bearer token, etc.), authorization (are you allowed to do this — typically RBAC), admission control (mutating and validating webhooks that can modify or reject the request — see the security topic), and finally the actual read/write against etcd. Any stage can reject the request, which is why a well-formed kubectl apply can still fail with a permissions error or an admission webhook rejection even though the YAML itself is syntactically valid.

What kubectl actually is

kubectl is a client binary with no special privileged access of its own — it authenticates using whatever credentials are configured in your kubeconfig file, and every single thing it does is exactly one or more calls to the same public API server endpoints that any other client (a CI pipeline, a custom controller, a monitoring tool) could call directly. This is why kubectl apply -f deployment.yaml and a Python script using the Kubernetes client library to PUT the same object are functionally identical from the API server's point of view.

# These achieve the same result via different means:
kubectl apply -f deployment.yaml

curl -X POST https://<api-server>/apis/apps/v1/namespaces/default/deployments \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/yaml" \
  --data-binary @deployment.yaml

Watches, not polling

A key API server feature many clients (including kubectl get pods --watch, and every controller) rely on is the watch mechanism — instead of repeatedly polling for changes, a client can open a long-lived connection and receive a stream of change events as they happen. This is the foundation of the reconciliation model: controllers watch for changes to the objects they care about and react immediately, rather than polling on a fixed interval.

Why this design matters

Because everything goes through one well-defined API, Kubernetes's entire ecosystem of tools (Helm, ArgoCD, custom controllers, monitoring dashboards) can all interact with a cluster the same consistent way, and the API itself can be extended (via Custom Resource Definitions — see that topic) without needing to change how any existing client talks to the cluster.