How does the Kubernetes scheduler decide which node to place a pod on?

6 minadvancedkube-schedulerscheduling-algorithm

Quick Answer

The scheduler processes each unscheduled Pod through two phases: **filtering** (eliminating every node that can't satisfy the Pod's hard requirements — insufficient resources, taints without matching tolerations, failing required node/pod affinity rules) to get a list of feasible nodes, then **scoring** (ranking the remaining feasible nodes using a set of weighted priority functions — spreading Pods evenly, honoring preferred affinity, minimizing resource fragmentation, and more) to pick the single best node among those that qualified.

Detailed Answer

Phase 1: Filtering — eliminate infeasible nodes

The scheduler starts with every node in the cluster and filters out any that can't satisfy the Pod's hard requirements:

  • Insufficient resources — does the node have enough unreserved CPU/memory to satisfy the Pod's requests (see the requests/limits question)?
  • Taints without a matching toleration — is the node tainted in a way this Pod doesn't tolerate (see that question)?
  • Required node affinity — does the node have the labels this Pod's requiredDuringSchedulingIgnoredDuringExecution node affinity demands?
  • Required pod affinity/anti-affinity — does placing this Pod here satisfy (or violate) its hard pod affinity/anti-affinity rules relative to Pods already on that node?
  • Volume/topology constraints — can the required storage actually be attached to this node (relevant for volumes with zone/node topology restrictions — see the StorageClass question)?
  • Port conflicts, node selectors, and several other basic feasibility checks.

After filtering, what remains is the set of feasible nodes — any node that could technically host this Pod. If this set is empty, the Pod stays Pending (potentially triggering the Cluster Autoscaler, or preemption, if configured — see those questions).

Phase 2: Scoring — rank the feasible nodes

Every feasible node is then scored using a set of weighted priority functions, and the highest-scoring node is chosen. Common scoring factors include:

  • Resource balance — preferring nodes that would end up with a more balanced ratio of CPU-to-memory usage after placement, avoiding one resource being nearly exhausted while another is idle.
  • Spreading — preferring to spread Pods of the same Deployment/Service across different nodes (related to, but distinct from, explicit pod anti-affinity — this is a softer, built-in default tendency).
  • Preferred affinity/anti-affinity — honoring preferredDuringSchedulingIgnoredDuringExecution rules, weighted by their configured weight.
  • Image locality — mildly preferring a node that already has the Pod's container image cached locally, avoiding a fresh image pull.

The scheduler sums these weighted scores and picks the node with the highest total — ties are broken pseudo-randomly, to avoid the scheduler always favoring the exact same node in ambiguous cases.

The scheduler only decides — it doesn't execute

Once the scheduler picks a node, it writes that decision back to the API server (setting the Pod's spec.nodeName) — it's then the kubelet on that specific node (see the fundamentals topic) that actually notices the assignment and starts the container via the runtime. The scheduler's job ends at the decision; it has no further involvement in actually running anything.

Extensibility: scheduler plugins and custom schedulers

Kubernetes's scheduler is built on a pluggable framework (the Scheduling Framework), letting organizations customize or extend filtering/scoring behavior with custom plugins, or even run an entirely separate custom scheduler for specialized workloads (a Pod can specify schedulerName to opt into a non-default scheduler) — useful for specialized scheduling needs (e.g., batch/HPC-style gang scheduling, where a whole group of Pods must be scheduled together or not at all) that the default scheduler's algorithm doesn't natively address.

Describing the two distinct phases — filter (hard, binary feasibility) then score (soft, weighted ranking) — rather than treating scheduling as one single opaque step, shows real understanding of how the scheduler actually reasons about placement, and explains why a Pod can be "feasible" on many nodes but still consistently land on a particular one.