41.6 Testing Upgrades in a Staging Cluster First

Look, I know you’re busy. The business is breathing down your neck for that new feature, and the idea of taking a perfectly good, running cluster and spending hours meticulously testing an upgrade feels like a luxury you can’t afford. I’m here to tell you it’s not a luxury; it’s your only life raft. Skipping this step is like skydiving and then checking if your parachute is in the bag. You will, at some point, have a catastrophic failure in production. The only question is whether you’ve practiced your emergency procedures in a safe environment first. A staging cluster is that safe environment. It’s where we break things on purpose so we don’t break them by accident later.

Your Staging Cluster is a Photocopy, Not a Sketch

Your staging environment must be a near-perfect replica of production. I’m not talking about the same number of nodes (though that’s ideal), I’m talking about the same configuration, the same versions of underlying operating systems, the same container runtime, the same network policies, and—this is the big one—the same data. If production has 50GB of data, staging needs a recent anonymized copy of it. Why? Because an upgrade isn’t just about the control plane; it’s about how the new API server talks to etcd, how the new kube-proxy handles your thousands of services, and how the new scheduler reacts to your actual, messy, complicated workloads. A bug might only surface under the specific pressure of your data and your traffic patterns.

You can use Velero to grab a backup from production and restore it to staging. It’s a few commands and it’s worth its weight in gold.

# On your production cluster, take a backup of the critical namespace
velero backup create prod-backup-2023-10 --include-namespaces my-critical-app

# On your staging cluster, restore it
velero restore create --from-backup prod-backup-2023-10

The Test Plan: More Than Just “Does It Boot?”

“Does it turn on?” is a pathetic test plan. You need a rigorous, automated checklist that proves everything that matters to your business still works. This isn’t just a k8s test; it’s an application test. Your plan should include:

Service Connectivity: Can all services still talk to each other as expected? Test with curl from a test pod.
Data Integrity: After the upgrade, do your applications still read and write correctly? Run a known query and check the result.
Performance: Did latency or throughput regress? Fire up a load test with something like hey or wrk.
Existing Operations: Can you still roll out a new deployment? Scale a stateful set? A great test is to run your entire CI/CD pipeline against the staged, upgraded cluster.

Here’s a simple script to test basic service connectivity post-upgrade. Save it as test_connectivity.sh and run it from a pod.

#!/bin/bash
# test_connectivity.sh
# This runs inside a cluster pod with curl and jq installed

SERVICES=("my-api-svc.my-namespace.svc.cluster.local:8080" "my-db-svc.my-namespace.svc.cluster.local:5432")

for SERVICE in "${SERVICES[@]}"; do
  echo "Testing connectivity to $SERVICE..."
  if curl -s --connect-timeout 5 "$SERVICE" > /dev/null; then
    echo "✅ Successfully connected to $SERVICE"
  else
    echo "❌ FAILED to connect to $SERVICE"
    exit 1
  fi
done

echo "All connectivity tests passed!"

# A simple Job manifest to run that test
apiVersion: batch/v1
kind: Job
metadata:
  name: post-upgrade-connectivity-test
spec:
  template:
    spec:
      containers:
      - name: tester
        image: curlimages/curl:latest # A tiny image with curl
        command: ["/bin/sh"]
        args: ["-c", "curl -O https://your-repo/test_connectivity.sh && chmod +x test_connectivity.sh && ./test_connectivity.sh"]
      restartPolicy: Never
  backoffLimit: 2

The Rollback Drill: Your Get-Out-of-Jail-Free Card

Before you even start the upgrade, you must know—and have tested—exactly how you will roll back. The moment things go sideways in production is not the time to be reading the manual. Your rollback strategy depends on your upgrade method. If you’re using kubeadm, you might have older nodes still in the cluster. If you’re using a managed service, know their SLA and process for rollbacks. The key is to have the commands pre-written and the process rehearsed in staging. Did the upgrade bork the control plane? Your rollback should be a muscle memory command.

# Example of a kubeadm rollback command (check the specific version #!)
sudo kubeadm upgrade undo --force # Use with extreme caution, and only because you practiced it first!

The Canary in the Coal Mine: Gradual Staging Upgrades

Don’t upgrade all your staging nodes at once. Be smart. Upgrade one worker node, cordon it, and drain a few non-critical workloads onto it. See how they behave. Then upgrade a control plane node and watch for any weird logs (journalctl -u kubelet is your friend). This gradual approach often reveals subtle, node-specific issues—like a container runtime or OS-level incompatibility—that a full-bore upgrade might mask until it’s too late. This is your chance to catch the weirdness before it catches you.