What is the Operator pattern, and how does it build on CRDs and controllers?
Quick Answer
An Operator is a custom controller, paired with one or more CRDs, that encodes the operational knowledge of running a specific piece of software — not just creating/deleting resources, but handling the ongoing lifecycle tasks a human operator would otherwise do manually (backups, failover, version upgrades, scaling decisions specific to that software). It extends the basic reconciliation-loop pattern (see that question) with domain-specific logic, letting complex, stateful applications be managed declaratively ("I want a 3-node PostgreSQL cluster, version 15") the same way a Deployment manages simple stateless replicas.
Detailed Answer
The problem Operators solve: encoding operational expertise as code
Running a stateful, complex piece of software well — a PostgreSQL cluster, Kafka, Elasticsearch — typically requires ongoing human operational knowledge: how to safely perform a failover, how to correctly execute a version upgrade without data loss, how to resize storage without downtime, what a healthy vs. unhealthy cluster state actually looks like for this specific software. StatefulSets (see the workload controllers topic) solve the scheduling and identity problem for stateful workloads, but know nothing about this specific software's operational rules — an Operator is where that specialized knowledge gets encoded as actual, automated, running code.
Anatomy: a CRD plus a controller with domain-specific logic
apiVersion: postgresql.example.com/v1
kind: PostgresCluster
metadata:
name: my-app-db
spec:
version: "15"
replicas: 3
storageSize: "100Gi"
Behind the scenes, an Operator's controller watches for PostgresCluster objects (a CRD) and reconciles the actual cluster state toward this desired spec — but unlike a generic controller managing simple replica counts, the Operator's reconciliation logic understands PostgreSQL-specific concerns:
Operator's reconciliation loop, for a PostgresCluster object:
→ does a StatefulSet with the right replica count and version exist? create/update if not.
→ is exactly one replica currently the primary, and are the others properly
configured as streaming replicas? fix the replication topology if not.
→ if the current primary becomes unhealthy, orchestrate a safe failover
to promote a healthy replica -- following Postgres's own specific
failover procedure, not a generic "just restart it" approach.
→ if spec.version changes, perform a safe, ordered version upgrade
across replicas, following Postgres's documented upgrade procedure.
Why this is more than "just a controller"
Every controller (including the built-in ones for Deployments, ReplicaSets, and so on) implements the same reconciliation-loop pattern — what makes something specifically an Operator is that the reconciliation logic encodes deep, software-specific operational knowledge, going well beyond simple "keep N replicas running." A well-built Operator effectively automates tasks a skilled human database administrator (or Kafka administrator, or whatever the target software is) would otherwise perform manually and carefully, making that expertise repeatable, consistent, and available on-demand via a simple declarative spec.
The Operator maturity model
Not every Operator does everything described above — the Operator Framework's commonly-cited maturity levels range from Level 1 (basic install/configuration automation) through Level 5 (full auto-pilot: automated upgrades, failure detection and recovery, and horizontal/vertical auto-scaling, all handled without human intervention). Many real-world Operators sit somewhere in the middle — automating the tedious/error-prone parts (initial setup, routine scaling, backups) while still leaving genuinely judgment-heavy decisions (a risky major version upgrade, a disaster-recovery scenario) to a human, deliberately.
Where Operators come from
You can build a custom Operator yourself (frameworks like the Operator SDK and Kubebuilder scaffold much of the boilerplate — CRD generation, controller wiring, testing setup), or, far more commonly for popular software, install an existing, published Operator built by the software's vendor or community (e.g., the Postgres Operator, Elasticsearch Operator, Prometheus Operator) via OperatorHub or a Helm chart, rather than building one from scratch for widely-used software that already has a mature Operator available.
Distinguishing an Operator from "just any controller" by pointing specifically to the domain-specific operational knowledge it encodes (failover procedures, upgrade sequencing, backup orchestration) — rather than just defining it as "a controller for custom resources" — demonstrates a real grasp of why the pattern exists and what problem it's actually solving.