22.7 OPA Gatekeeper: Policy-as-Code

Right, let’s talk about OPA Gatekeeper. You’ve probably heard the term “Policy-as-Code” thrown around like confetti at a DevOps conference. It sounds great, but what does it actually mean? It means you stop treating your cluster’s security and governance rules as a set of sticky notes on a monitor and start treating them like, well, code. Version-controlled, testable, reviewable code. That’s what Gatekeeper does: it takes the brilliant general-purpose policy engine Open Policy Agent (OPA) and bolts it directly into Kubernetes, using admission controllers to give you a firm veto power over anything trying to enter your cluster.

Think of it as your cluster’s very own bouncer. This bouncer doesn’t care about shoe style or how famous you say you are; it cares about a strict, predefined list of rules. “Does this Pod have a memory limit? No? You can’t come in.” It’s the enforcement mechanism for all those best practices you keep meaning to implement but never quite get around to.

The Core Architecture: Constraint Templates and Constraints

Gatekeeper’s power—and its initial complexity—comes from its two-layer architecture. You don’t just write a rule; you first define a kind of rule (a Template), and then you create specific instances of that rule (Constraints). This is actually brilliant because it separates the logic (written in Rego, OPA’s policy language) from the configuration.

First, you define a ConstraintTemplate. This is where the Rego lives. It’s the generic blueprint for a type of policy.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels # This becomes the kind of your actual Constraint
      validation:
        openAPIV3Schema:
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("you must provide these labels: %v", [missing])
        }

This template creates a new custom resource definition (CRD) for a K8sRequiredLabels kind. The Rego code inside is the brains. It takes a list of labels as a parameter, checks the incoming object, and creates a “violation” if any are missing.

Now, the actual Constraint is much simpler. It’s just an instance of that blueprint, saying what labels are required and where to apply the rule.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels # This matches the kind from the Template
metadata:
  name: must-have-team-label
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    namespaces:
      - "production"
  parameters:
    labels: ["team"]

This constraint uses our template to enforce that every Pod in the production namespace must have a team label. See how clean that separation is? You write the complex Rego once, and your cluster admins can create specific constraints by just filling in the parameters.

The Rego You Need to Know (Without the Headache)

Rego is a powerful language, but it can feel weird if you’re used to imperative programming. The snippet above is a classic pattern. The violation rule is what Gatekeeper looks for. If that rule evaluates to true for an incoming object, the request is denied, and the msg is sent back to the user.

The key thing to understand is the input object. Gatekeeper automatically provides a rich input structure containing the entire object under review (input.review.object), the operation type (CREATE, UPDATE), and more. Your job is to write Rego that queries this input to find a violation.

Common Pitfalls and How to Avoid Them

Match Blocks Are ANDs, Not ORs: The match block in a constraint is additive. If you specify kinds: ["Pod"] and namespaces: ["production"], it applies to Pods IN the production namespace. To get an OR effect (e.g., Pods OR Namespaces), you often need to create multiple constraints. This trips everyone up.
Dry Run is Your Best Friend: Before you set a constraint to enforcementAction: deny, set it to dryrun: true. This lets you see what would be denied in the audit logs without actually breaking anything. It’s the equivalent of measuring twice before cutting.
The Rego Itself is a Denial-of-Service Risk: Your policy rules execute in real-time during admission. If you write pathologically inefficient Rego (like a rule that walks every Pod in the cluster for every new request), you can seriously slow down your API server. Keep your rules focused and avoid operations that scale with cluster size.
Mutations Are a Different Beast: Notice that our example only validates. It says yes/no. Gatekeeper can also mutate objects, changing them on the fly to add labels, set defaults, etc. This is incredibly powerful but also incredibly dangerous. The mutation system is more complex and harder to debug. Tread carefully and have a rollback plan.

The designers made a questionable choice by not having a simpler, YAML-only mode for the most common policies, forcing you to learn Rego for even the simplest tasks. But once you get over that hump, you realize that Rego’s expressiveness is what makes the whole thing work. It’s the price of admission for having a bouncer this smart.