Security

Difficulty

Role — namespace-scoped permissions

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]

Defines a set of permitted actions (verbs) on specific resource types, scoped to exist only within the production namespace — this Role has no effect on Pods in any other namespace.

RoleBinding — granting a Role to an identity, within a namespace

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods-binding
  namespace: production
subjects:
  - kind: User
    name: alice
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Grants the pod-reader Role's permissions specifically to the user alice, and specifically within the production namespace. A RoleBinding can also reference a ClusterRole (not just a namespace-scoped Role) — in that case, it grants that ClusterRole's permissions, but still only within the binding's own namespace, which is a useful pattern for reusing one common set of permissions (defined once as a ClusterRole) across many different namespace-scoped bindings.

ClusterRole — cluster-wide (or cluster-scoped-resource) permissions

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-reader
rules:
  - apiGroups: [""]
    resources: ["nodes"]     # Nodes are a cluster-scoped resource -- no namespace applies
    verbs: ["get", "list", "watch"]

Needed for any permission concerning cluster-scoped resources (Nodes, PersistentVolumes, ClusterRoles/ClusterRoleBindings themselves) since these don't belong to any namespace at all — a namespaced Role simply has no way to grant access to them.

ClusterRoleBinding — granting a ClusterRole cluster-wide

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: read-nodes-global
subjects:
  - kind: Group
    name: platform-team
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: node-reader
  apiGroup: rbac.authorization.k8s.io

Grants the node-reader ClusterRole's permissions across the entire cluster, to everyone in the platform-team group — the broadest possible scope of grant.

The four-way combination, summarized

Permissions definedGrant scoped to
Role + RoleBindingOne namespaceThat same namespace
ClusterRole + RoleBindingCluster-wide definition, reusedThat RoleBinding's namespace only
ClusterRole + ClusterRoleBindingCluster-wide definitionThe whole cluster

Why the separation between "permission definition" and "grant" exists

Defining permissions (Role/ClusterRole) separately from granting them (RoleBinding/ClusterRoleBinding) to a specific subject lets one reusable permission set be bound to many different users/teams/namespaces without redefining the underlying rules each time — e.g., a single pod-reader ClusterRole can be bound via separate RoleBindings in team-a's namespace and team-b's namespace, each granting the same read-only pod access, scoped independently to each team's own namespace.

Default to the most narrowly-scoped combination that satisfies the real need — a namespaced Role + RoleBinding for anything that can be namespace-scoped, reserving ClusterRole + ClusterRoleBinding for genuinely cluster-wide needs (platform/infrastructure teams, cluster-scoped resources) — this is a direct application of the least-privilege principle to Kubernetes's own access model.

Why Pods need their own identity

Some applications running inside a Pod need to talk to the Kubernetes API server themselves — a custom controller watching for changes to a Custom Resource, a CI/CD tool creating new Deployments, or simply an application that needs to look up its own Pod's metadata. This requires an identity to authenticate as, distinct from any human user's own credentials — that's exactly what a ServiceAccount provides.

Creating and using a ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-manager
  namespace: production
apiVersion: v1
kind: Pod
metadata:
  name: my-controller
spec:
  serviceAccountName: pod-manager   # explicitly assign this ServiceAccount
  containers:
    - name: controller
      image: my-controller:1.0

If serviceAccountName isn't specified, the Pod automatically uses that namespace's default ServiceAccount — a detail worth knowing, since it means every Pod always authenticates as some identity, even if you never explicitly thought about which one.

How the credential actually gets into the Pod

Kubernetes automatically mounts a projected volume into every Pod at /var/run/secrets/kubernetes.io/serviceaccount/, containing a short-lived, auto-rotating bound service account token (a JWT), along with the cluster's CA certificate and the current namespace — any code inside the container can read this token and present it as a bearer token when calling the API server directly.

# Inside a Pod, this is how application code (or a Kubernetes client library)
# authenticates to the API server as the Pod's ServiceAccount:
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -H "Authorization: Bearer $TOKEN" \
     --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
     https://kubernetes.default.svc/api/v1/namespaces/production/pods

Most Kubernetes client libraries (used when building custom controllers/Operators — see the extensibility topic) handle this automatically via "in-cluster config" detection, so application code rarely constructs these requests by hand.

Binding permissions to a ServiceAccount

A bare ServiceAccount has no permissions by default — it must be granted permissions the same way any other RBAC subject is, via a RoleBinding or ClusterRoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-manager-binding
  namespace: production
subjects:
  - kind: ServiceAccount
    name: pod-manager
    namespace: production
roleRef:
  kind: Role
  name: pod-editor
  apiGroup: rbac.authorization.k8s.io

Why the default ServiceAccount should almost never be granted broad permissions

Because every Pod that doesn't explicitly specify a ServiceAccount silently uses default, granting broad permissions to a namespace's default ServiceAccount effectively grants those permissions to every Pod in that namespace, including ones that never intended to need API access at all — a common, easy-to-introduce security misconfiguration. Best practice is to leave default unprivileged, and create dedicated, narrowly-scoped ServiceAccounts (with correspondingly narrow RoleBindings) for the specific Pods that genuinely need API access.

disabling auto-mounting when not needed

spec:
  automountServiceAccountToken: false

For Pods that don't need to talk to the API server at all (the majority of ordinary application workloads), explicitly disabling the automatic token mount removes an unnecessary credential from the Pod's filesystem entirely — a small but meaningful hardening step, reducing what an attacker could exfiltrate from a compromised container that had no legitimate need for API access in the first place.

A hardened example

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:              # Pod-level: applies to all containers by default
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
  containers:
    - name: app
      image: myapp:1.0
      securityContext:          # container-level: can override the Pod-level settings
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE   # only add back the specific capability actually needed

Key settings and what each one hardens against

  • runAsNonRoot / runAsUser — many container images default to running as root (UID 0) inside the container unless told otherwise; forcing a non-root UID means that even if an attacker achieves code execution inside the container, they don't automatically have root-level privileges within that container's own namespace, limiting what they can further tamper with (though container root is still not equivalent to host root, given proper isolation — this is defense in depth, not the only layer).
  • allowPrivilegeEscalation: false — prevents a process from gaining more privileges than its parent process had (blocking, among other things, setuid binaries from escalating privilege inside the container) — a meaningful hardening step against a specific class of container escape/privilege-escalation technique.
  • readOnlyRootFilesystem: true — makes the container's own root filesystem immutable at runtime; an attacker who achieves code execution can't write a persistent backdoor or modify application binaries on disk, though the application must then explicitly mount a writable volume (like emptyDir) for any directory it legitimately needs to write to (temp files, caches).
  • capabilities: drop: [ALL], then selectively add — Linux capabilities are fine-grained permissions that break up what used to be the monolithic "root" privilege (e.g., NET_BIND_SERVICE for binding to ports below 1024, SYS_ADMIN for a wide range of administrative operations). Dropping all capabilities and adding back only the specific ones a container genuinely needs is a direct application of least privilege at the kernel-capability level — most containers need zero or very few capabilities beyond the default set the runtime already restricts.
  • fsGroup — sets the group ownership of mounted volumes, letting a non-root user still have appropriate write access to volume-backed storage without needing to run as root.

Why this matters: limiting the blast radius of a compromised container

Container isolation (namespaces, cgroups) already provides real separation from the host, but it's not an absolute security boundary — container escape vulnerabilities do periodically get discovered, and a poorly-hardened container (running as root, with unnecessary Linux capabilities, a writable root filesystem, and unrestricted privilege escalation) gives an attacker who achieves code execution inside it a much larger set of tools to work with than a properly hardened one. SecurityContext settings are exactly the mechanism for closing off unnecessary privilege a container was never going to legitimately need.

Enforcing this cluster-wide, not just per-Pod

Rather than relying on every team to remember to set these fields correctly in every Pod spec, most clusters enforce baseline SecurityContext requirements cluster-wide (or per-namespace) via Pod Security Admission (see that question) — rejecting or flagging Pods that don't meet a minimum security bar (e.g., the restricted Pod Security Standard requires most of the settings shown above) rather than trusting every individual manifest to have gotten it right voluntarily.

Why PodSecurityPolicy was deprecated and removed

PSP let you define arbitrarily customizable security policies (allowed capabilities, allowed volume types, required user IDs, and much more) — but applying a PSP to a Pod worked through an unusually indirect mechanism: a PSP had to be granted via RBAC to a user or ServiceAccount, and which PSP actually applied to a given Pod depended on RBAC evaluation order in ways that were widely regarded as confusing and error-prone in practice. This complexity was cited as the primary reason for deprecating it in favor of something simpler.

Pod Security Admission — the replacement

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Rather than custom policy objects, Pod Security Admission enforces one of three predefined, standardized levels (the Pod Security Standards), simply by labeling a namespace:

LevelBehavior
privilegedUnrestricted — no security requirements enforced at all
baselineBlocks known privilege-escalation paths (e.g., disallows privileged containers, host namespaces) while remaining broadly compatible with common workloads
restrictedThe most hardened standard — requires non-root, disallows privilege escalation, requires dropping all Linux capabilities except a small allowed set, requires seccompProfile, and more

Three separate label keys control three independent modes: enforce (actually reject non-compliant Pods), audit (allow them, but log a warning in the audit log), and warn (allow them, but return a warning to the user submitting the Pod) — commonly used together to warn/audit at a stricter level while only enforce-ing a more lenient one, giving teams visibility into what would be rejected under a stricter policy before actually flipping enforcement on.

Why this is a meaningful simplification

Pod Security Admission is deliberately less flexible than PSP was — you choose from three fixed levels rather than defining arbitrary custom rules — but this tradeoff was a deliberate design choice: most of PSP's complexity came from unlimited customizability that few teams actually needed and many implemented incorrectly. The three-level model covers the vast majority of real-world security posture needs with drastically simpler, namespace-label-based configuration that's much easier to reason about and audit.

What if you need more customization than the three standard levels offer

For genuinely custom policy needs beyond the three built-in levels, the ecosystem has moved toward general-purpose policy engines — OPA Gatekeeper and Kyverno are the two most widely adopted — which use admission webhooks (see that question) to enforce arbitrary custom policies, not limited to Pod security specifically (they can validate/mutate any Kubernetes object type against custom rules). Many clusters run Pod Security Admission for the baseline hardening it provides simply and natively, layered with a policy engine like Kyverno or Gatekeeper for anything requiring finer-grained or organization-specific rules.

Knowing that PSP was removed (not just deprecated) as of 1.25, and that Pod Security Admission trades PSP's flexibility for three simple, standardized levels specifically to fix PSP's usability/complexity problems, demonstrates awareness of a real, relatively recent, and commonly-tested Kubernetes ecosystem shift — not just familiarity with security concepts in the abstract.

Where admission control fits in the request pipeline

Recall the API server's request pipeline (see the API server question): authentication (who are you) → authorization (are you allowed to do this action) → admission control (should this specific request actually be allowed/modified, given business/policy rules) → persist to etcd. Even a fully authenticated and authorized request can still be rejected or altered at the admission stage — this is where cluster-specific policy enforcement lives, distinct from the more general "is this identity allowed to do this kind of thing at all" question RBAC answers.

Built-in admission controllers

Kubernetes ships with several built-in admission controllers compiled into the API server (enabled via a startup flag) — examples include NamespaceLifecycle (prevents creating objects in a namespace that's being deleted), LimitRanger (enforces LimitRange defaults/constraints), ResourceQuota (enforces namespace resource quotas), and, notably, PodSecurity (implementing Pod Security Admission — see that question).

Custom admission via webhooks

Beyond the built-in set, the API server can call out to external webhook services for custom admission logic — this is how tools like OPA Gatekeeper, Kyverno, Istio's sidecar injector, and cert-manager all plug into the cluster's request pipeline without needing to be built into Kubernetes itself.

MutatingAdmissionWebhook — can modify the object

Incoming request: create a Pod
   → Mutating webhook (e.g., Istio's injector) intercepts it
   → Modifies the Pod spec to add an Envoy sidecar container
   → The MODIFIED Pod spec continues through the pipeline

A mutating webhook receives the incoming object and can return a modified version of it (typically expressed as a JSON patch) — this is exactly the mechanism a service mesh uses to automatically inject a sidecar proxy into every Pod without the Pod's original author needing to include it themselves, and how tools might automatically inject default resource requests/limits, labels, or annotations onto objects that don't specify them.

ValidatingAdmissionWebhook — can only accept or reject

Incoming request: create a Pod (now possibly already mutated above)
   → Validating webhook (e.g., a policy engine) checks it against custom rules
   → Either allows it through unchanged, or rejects it with an error

A validating webhook cannot modify the object at all — it can only inspect the (already fully mutated) object and return an allow/deny decision, optionally with an explanatory message shown back to whoever submitted the request. This is how organization-specific policies get enforced — e.g., "every Deployment must have resource limits set," "container images must come from our approved internal registry," "no Pod may run as root" (for organizations wanting rules beyond what the built-in Pod Security Standards cover).

Why mutating webhooks run before validating ones

This ordering is deliberate and important: mutation happens first, so that by the time validation runs, it's evaluating the final state of the object — including anything automatically added by mutating webhooks — rather than validating an intermediate, incomplete version that's about to change. If this order were reversed, a validating webhook might approve an object that a subsequent mutation then changes into something that would have failed validation, silently undermining the policy enforcement's whole purpose.

Why this matters operationally

Admission webhooks are themselves a critical-path dependency for the entire cluster's ability to create/modify objects — if a webhook service is down, misconfigured, or slow, it can (depending on its configured failurePolicy) either fail open (allow requests through, defeating its purpose) or fail closed (block all matching requests cluster-wide, including entirely legitimate ones, until the webhook service recovers). Deploying webhook services with high availability and sensible timeouts is a genuine, non-trivial operational responsibility for any cluster relying on them for critical policy enforcement.