41.4 Upgrading Managed Clusters: EKS, GKE, AKS Strategies

Right, so you’ve got a production cluster humming along nicely. It’s serving traffic, the metrics look good, and you’re feeling pretty smart. Then you remember: the Kubernetes version is about to fall out of support. Panic? No. We don’t do that. We do planned, methodical, slightly nerve-wracking upgrades. With managed services like EKS, GKE, and AKS, the cloud providers have done a lot of the heavy lifting, but they’ve also handed you a box of very powerful, very sharp tools. It’s your job not to accidentally amputate a production workload.

The first rule of Upgrade Club is to read the provider’s release notes. No, really. I’m not just saying that. They will tell you about deprecated APIs that are being removed, which is the single biggest cause of upgrade-induced heartbreak. Your old networking.k8s.io/v1beta1 Ingress? It’s dead, Jim. You need to migrate to networking.k8s.io/v1 before you upgrade the control plane. Finding this out mid-upgrade is a terrible, no-good, very bad day.

Pre-Flight Checklist: Your Get-Out-of-Jail-Free Card

Never, ever upgrade without a recent backup of your critical state (etcd snapshots are managed for you, but your database living in the cluster isn’t). Then, run kubectl get all --all-namespaces and look for anything using deprecated APIs. There are fantastic tools like pluto or kube-no-trouble that can do this for you.

# Let's find those deprecated APIs before they find us
curl -sSL https://github.com/doitintl/kube-no-trouble/releases/latest/download/kubent-5.2.3-linux-amd64.tar.gz | tar -xz
./kubent
# Example output: 
# >>> Deprecated APIs found!
# ---
# KIND        NAMESPACE  NAME                    API_VERSION
# Ingress     production my-legacy-ingress       extensions/v1beta1

Fix those. Now. Also, check your Pod Disruption Budgets (PDBs). Are they too restrictive? A PDB requiring 100% availability for a StatefulSet means the upgrade process will politely refuse to evict your pod, bringing the entire node drain process to a screeching halt. It’s the equivalent of putting a “Do Not Disturb” sign on a hotel room door that’s actively on fire.

The Control Plane Dance: Let the Provider Do the Work

This is the easy part. For a managed service, you click a button or run a CLI command and the provider upgrades the API server, etcd, scheduler, etc. The genius of it is that they typically do this in a rolling fashion, so you shouldn’t experience an API outage. The catch? Once the control plane is upgraded, it starts enforcing the rules of the new version. If you still have manifests trying to use that deprecated Ingress API, your kubectl apply commands will start failing. See why we did the pre-flight check?

Here’s how you kick it off for EKS, GKE, and AKS. Notice how they all have different, and frankly, kinda silly names for the same concept.

# AWS EKS (Because 'update' was too obvious)
aws eks update-cluster-version --name my-cluster --kubernetes-version 1.27

# Google GKE (It's just 'upgrade', bless them)
gcloud container clusters upgrade my-cluster --cluster-version 1.27 --master

# Microsoft AKS (Powershell-friendly, of course)
az aks upgrade --resource-group myResourceGroup --name myAKSCluster --kubernetes-version 1.27

This will take ten to twenty minutes. Go get a coffee. Stare at the console. Watch the metrics like a hawk. It’s fine. Everything is fine.

Node Groups: The Real Heart of the Operation

The control plane is new and shiny, but your workloads are still running on old, crusty nodes. Now we have to upgrade the worker nodes. The strategy here is crucial. The “cattle, not pets” philosophy gets its real test during a node upgrade.

The managed service way is to create a new node group (or node pool) with the desired AMI/OS image that matches the new control plane version. Then, you cordon and drain the old nodes, letting the workloads reschedule onto the new, upgraded nodes. GKE and AKS have nice automated options for this, but I’m a control freak and prefer a blue-green style, manual process. It’s slower but gives me maximum visibility and a rollback path (just terminate the new nodes and uncordon the old ones).

# First, label the new node pool so we can target it
kubectl label nodes -l cloud.google.com/gke-nodepool=new-pool node-role.kubernetes.io/upgraded=

# Cordon all nodes in the old pool to prevent new schedules
kubectl cordon -l cloud.google.com/gke-nodepool=old-pool

# Now, evict the pods gracefully, respecting PDBs
kubectl drain -l cloud.google.com/gke-nodepool=old-pool --ignore-daemonsets --delete-emptydir-data

Watch the pods reschedule. Some will come up instantly. Others, like stateful applications, will be grumpy. This is where you discover that one application that had a local hostPath volume it shouldn’t have been using and now its data is gone. Sorry. You learned a valuable lesson about persistent volumes.

The Aftermath: Trust, but Verify

The upgrade is “done” when the last old node is terminated. But you’re not done. Your job now is to be paranoid.

# Did everything come back?
kubectl get pods --all-namespaces | grep -v Running

# Are the nodes all on the new version?
kubectl get nodes

# Is the cluster even functional? Run a simple test deployment.
kubectl create deployment upgrade-test --image=nginx --replicas=2
kubectl get pods -l app=upgrade-test

Run your full battery of integration tests. Check the logs for any new, weird errors. The upgrade didn’t cause them, but it revealed them. The cluster is now a higher-versioned, more secure, and supported environment. You’ve survived. Until the next release, of course. See you in six months.