33.5 Istio mTLS: Automatic Certificate Rotation and Peer Authentication

Right, let’s talk about one of Istio’s killer features: automatic mTLS. This is the stuff that makes security teams weep with joy and operators sleep soundly at night. You’ve deployed Istio, and suddenly, without you lifting a finger, all the traffic between your pods is encrypted and mutually authenticated. It feels like magic. But you and I are engineers; we don’t trust magic. We trust systems that are explicit, understandable, and occasionally, a bit of a pain to debug. So let’s pop the hood.

The core of this magic is a per-pod sidecar Envoy proxy and a central component called the istiod. istiod is the brain, the certificate authority (CA), and the anxious parent all rolled into one. When a pod starts up, its sidecar (the istio-agent) doesn’t have any credentials. It makes a call home to istiod’s secure gRPC service. istiod says, “Ah, a new citizen! Welcome, here is your SPIFFE identity document,” and hands it a signed X.509 certificate and private key. This certificate isn’t just any cert; it’s beautifully specific, following the SPIFFE standard with a URI like spiffe://your-cluster/ns/default/sa/my-app. This means it’s not just proving what it is, but where it is and who it runs as. This granularity is what makes the zero-trust model actually workable.

The Beautiful, Automatic Certificate Rotation

Here’s the best part: this certificate has a brutally short lifespan. We’re talking a default of 24 hours. This isn’t Istio being sadistic; it’s a core security best practice called short-lived certificates. If an attacker manages to exfiltrate a certificate, its value is extremely limited. It’ll be useless in a day.

But wait, you think, “I’m not restarting my pods every day!” Exactly. You don’t have to. The istio-agent inside the sidecar is constantly monitoring the expiration of its own cert. When it gets to about 80% of its life (a bit over 19 hours in), it goes back to istiod and says, “My thing is about to expire, can I get a new one?” istiod happily issues a new certificate, and the agent rotates it seamlessly. The Envoy proxy gets the new cert hot-swapped without dropping a single connection. This all happens in the background, completely transparently to your application. To see it in action, you can peek into a pod:

kubectl exec -it <your-pod> -c istio-proxy -- \
curl --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem \
https://localhost:15000/certs | jq

You’ll see the cert and its expiration. Run it again a few hours later, and you’ll see the cert has changed. This is operational security on autopilot.

Telling Services Who to Trust: PeerAuthentication

Okay, so every pod has a cool ID card. But how does one pod know it should trust another pod? This is where PeerAuthentication comes in. This resource is your bouncer, telling the sidecar proxies exactly how strict to be at the door.

The most common pitfall here is misunderstanding the hierarchy of these policies. They can be set cluster-wide, per-namespace, or per-workload, with more specific policies overriding broader ones. A rookie move is to slam a strict policy on the entire cluster (mesh-wide) without first ensuring all your legacy services are in the mesh or properly exempted. You will break everything. Don’t ask me how I know.

The sane approach is to start per-namespace. Let’s say you have a payments namespace where you need the highest security.

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: payments
spec:
  mtls:
    mode: STRICT

This policy tells every sidecar in the payments namespace: “Only accept connections from clients that present a valid certificate issued by our CA (istiod).” No cert? No entry.

But what about that legacy service in the legacy namespace that can’t have a sidecar injected? You use a more specific policy to relax the rules for it, or better yet, use a PERMISSIVE mode policy for its namespace. PERMISSIVE is Istio’s “come as you are” mode: it will accept both encrypted mTLS traffic and plaintext traffic. This is your migration tool.

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: legacy
spec:
  mtls:
    mode: PERMISSIVE

The Rough Edges and How to Not Cut Yourself

It’s not all roses. The certificate rotation is robust, but it depends on istiod being available. If your control plane is down for longer than the certificate grace period, new pods won’t get certs and existing ones will eventually expire. Your mesh will grind to a halt. This is why high availability for istiod isn’t a nice-to-have; it’s an absolute requirement for any production cluster.

Another edge case is jobs. A short-lived Job or CronJob pod might start and finish before the istio-agent even gets around to its first certificate renewal. For these, the initial certificate request is the only one that matters, so just ensure your istiod is responsive during job execution windows.

The most common “oh crap” moment is when you see RBAC: access denied or TLS error: 268435703:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER in your logs. This almost always means a traffic mismatch. A client outside the mesh is trying to talk to a service with a STRICT policy, or a service inside the mesh is trying to talk to an external service using mTLS (which it obviously won’t understand). For the latter, you need a well-crafted ServiceEntry to manage that egress traffic properly.

The best practice? Use PERMISSIVE mode in namespaces as you onboard, monitor the istio_requests_total metric for the response_flags field, looking for "NR" (No Route) and "UA" (Unauthorized), and gradually tighten policy to STRICT once you’re confident all intended traffic is encrypted. Let the metrics tell you the story, not your screaming phone at 3 a.m.