How does Kubernetes expose metrics, and what's the role of metrics-server vs. Prometheus?
Quick Answer
**metrics-server** collects lightweight, real-time CPU/memory resource metrics from every node's kubelet and exposes them through the Kubernetes Metrics API — used specifically to power `kubectl top` and the Horizontal Pod Autoscaler's resource-based scaling, but it stores no history at all (only the current snapshot). **Prometheus** is a full-featured, general-purpose monitoring system that scrapes and stores detailed, historical time-series metrics from many sources (application-level custom metrics, node-level metrics, Kubernetes object states), enabling dashboards, alerting, and long-term trend analysis that metrics-server was never designed to provide.
Detailed Answer
metrics-server — minimal, real-time, no history
kubectl top pod
kubectl top node
metrics-server collects basic CPU/memory usage from every node's kubelet (which itself gets this data from cAdvisor, embedded in the kubelet) at a regular interval, and exposes it through the standard Kubernetes Metrics API (metrics.k8s.io). This is deliberately minimal by design: it holds only the current/most recent snapshot in memory — no historical data, no long-term storage, no querying capability beyond "what's the current usage." Its entire purpose is powering exactly two things: kubectl top (for quick, ad-hoc human inspection) and the Horizontal Pod Autoscaler's resource-based scaling decisions (see that question), both of which only need the current value, not history.
Prometheus — full-featured monitoring and alerting
Prometheus is a general-purpose time-series monitoring system, not Kubernetes-specific, but with excellent native Kubernetes integration (via service discovery that automatically finds Pods/Services to scrape based on annotations or the Kubernetes API). It scrapes metrics endpoints (applications, and cluster components, expose metrics in Prometheus's text format at an HTTP endpoint, typically /metrics) at a configured interval and stores the resulting time series durably, supporting rich querying (via PromQL), dashboards (commonly via Grafana), and alerting (via Alertmanager) based on arbitrary conditions over time (e.g., "alert if error rate exceeds 5% for more than 5 minutes").
Prometheus scrapes:
- kube-state-metrics (Kubernetes object states: Deployment replica counts, Pod phases, etc.)
- node-exporter (node-level OS metrics: disk, network, detailed CPU/memory)
- application /metrics endpoints (custom, application-specific metrics)
- cAdvisor (container-level resource usage, more detailed than metrics-server's snapshot)
Why you typically need both, not one or the other
metrics-server and Prometheus solve genuinely different problems and commonly coexist in the same cluster: metrics-server is the lightweight, always-on dependency that HPA and kubectl top specifically require (and Prometheus doesn't natively plug into the HPA's resource-metric mode without an additional adapter); Prometheus is the tool for everything else — dashboards, alerting, capacity planning, debugging a specific incident by looking at historical trends, and scaling on custom application-level metrics (which does require a Prometheus adapter to feed into the Custom Metrics API for HPA to consume — see that question).
kube-state-metrics — a commonly paired component
Distinct from both of the above: kube-state-metrics exposes the state of Kubernetes objects themselves as Prometheus-scrapeable metrics (how many replicas a Deployment currently has vs. desires, how many Pods are in each phase, node conditions) — this is object-state information, not resource-usage information, and is what lets Prometheus/Grafana dashboards show things like "how many Deployments currently have fewer ready replicas than desired" across a whole cluster.
Every cluster running an HPA needs metrics-server (or, more rarely, an equivalent custom metrics pipeline) as a baseline dependency; any cluster that cares about historical trends, alerting, or custom application metrics needs a full monitoring stack like Prometheus (often bundled as the "kube-prometheus-stack" Helm chart, including Prometheus, Grafana, Alertmanager, and kube-state-metrics together) layered on top — treating metrics-server as a substitute for real monitoring is a common early-stage mistake, since it was never designed to serve that broader purpose.