32.6 Deployment Strategies: Blue/Green and Canary with Argo Rollouts

Right, so you’ve got your app containerized, your YAML files are in order, and you’re happily deploying to Kubernetes with kubectl apply. It works. But let’s be honest, it’s a bit like performing open-heart surgery with a sledgehammer. One apply and you’ve replaced every single running instance of your application at once. If you’ve ever felt a cold sweat at that moment, congratulations, you’re not a psychopath. You’ve just outgrown basic deployments.

We need a strategy. We need control. We need to stop praying to the deployment gods and start getting real data and safety mechanisms between our new code and our users. This is where Blue/Green and Canary deployments come in, and where Argo Rollouts absolutely shines by making this complexity not just manageable, but elegant.

Why Roll Your Own with Argo Rollouts?

You could try to orchestrate a canary deployment with a bunch of vanilla Kubernetes Deployment objects, tweaking replicas counts manually and writing custom scripts to check metrics. I’ve seen it done. It’s ugly, it’s brittle, and it will fail at 3 AM. Argo Rollouts is a Kubernetes controller that introduces a new custom resource, the Rollout, which is a drop-in replacement for your standard Deployment resource but with all the fancy strategies built-in.

It handles the tedious bits: gradually shifting traffic, running automated analysis before promotion, and rolling back automatically if things look shaky. It’s the difference between manually steering a sailboat and putting it on autoplast.

Your First Blue/Green Rollout

The Blue/Green strategy is the concept of having two identical, fully-scaled environments: your live “Green” environment and your new “Blue” environment. You switch all traffic from one to the other in one go. It’s simple, fast, and gives you an instant rollback path (just switch back to the old color).

Here’s what a Rollout manifest for a Blue/Green strategy looks like. Notice it uses the Rollout kind, not Deployment.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app-bluegreen
spec:
  replicas: 3
  strategy:
    blueGreen:
      # The active service selector will update to match the new ReplicaSet
      activeService: my-app-active
      # The preview service will always point to the new (blue) ReplicaSet before promotion
      previewService: my-app-preview
      # Auto-promote the new ReplicaSet after this delay. Set to 0 to manually promote.
      autoPromotionSeconds: 30
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-registry/my-app:v2.0.0
        ports:
        - containerPort: 8080

You’ll need two Services. The activeService is what your users hit. The previewService is for you to manually test the new version before cutting over.

# This service will switch its label selector to point to the active ReplicaSet
apiVersion: v1
kind: Service
metadata:
  name: my-app-active
spec:
  selector:
    app: my-app
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

# This service will always point to the new, "preview" ReplicaSet
apiVersion: v1
kind: Service
metadata:
  name: my-app-preview
spec:
  selector:
    app: my-app
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

The magic is in the labels. When you update the Rollout’s image, Argo creates a new ReplicaSet (v2.0.0) and points the previewService to it. The activeService keeps pointing to the old, stable one. You can test the preview service to your heart’s content. After autoPromotionSeconds (or when you manually promote), Argo updates the activeService’s selector to point to the new pods. Traffic instantly switches. Beautiful.

The More Nuanced Canary Approach

Canary deployments are for when you want to be more cautious, or when you have a ton of traffic and can learn from a small percentage of it. Instead of a full switch, you slowly send increasing amounts of traffic to the new version while monitoring key metrics. If something goes wrong, you automatically roll back.

This is where Argo Rollouts gets powerful. You can define a steps array that dictates the pace of the rollout and, crucially, an analysis that runs at each step to decide if it’s safe to proceed.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app-canary
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: my-app-canary-svc # Service pointing to canary pods
      stableService: my-app-stable-svc # Service pointing to stable pods
      trafficRouting:
        istio: # Also supports nginx, ALB, SMI
          virtualService:
            name: my-app-vs
            routes:
            - primary
      steps:
      - setWeight: 10
      - pause: {duration: 2m} # Wait 2 minutes, check metrics
      - setWeight: 25
      - pause: {duration: 2m}
      - setWeight: 50
      - pause: {duration: 5m} # A longer pause at a major milestone
      - setWeight: 100
      - pause: {duration: 10m} # A final pause before automatic completion
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-registry/my-app:v2.1.0
        ports:
        - containerPort: 8080

The Killer Feature: Automated Analysis

Defining steps is good. Letting Argo decide based on metrics is brilliant. You can define an AnalysisTemplate that Argo will run before promoting your canary. Is error rate spiking? Is latency too high? If so, it automatically aborts the rollout.

# AnalysisTemplate.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: request-success-rate
    interval: 1m
    # Where to fetch the metric from
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(rate(istio_requests_total{reporter="destination", destination_workload=~"{{args.workload-name}}", response_code!~"5.*"}[1m])) / sum(rate(istio_requests_total{reporter="destination", destination_workload=~"{{args.workload-name}}"}[1m]))
    # The failure condition
    failureLimit: 1
    failureCondition: result[0] < 0.95 # Roll back if success rate drops below 95%

You then reference this template in your Rollout strategy, often in a step for a pause where analysis is required:

strategy:
  canary:
    ...
    steps:
    - setWeight: 20
    - pause: {}
    - analysis: # Run this analysis before proceeding
        templates:
        - templateName: success-rate
        args:
        - name: workload-name
          value: my-app-canary # The name of the canary workload
    - setWeight: 40
    ...

The Rough Edges and Pitfalls

This isn’t all rainbows and unicorns. The traffic routing (Istio, Linkerd, etc.) setup is a non-trivial piece of extra complexity. You’re now managing a Rollout resource, multiple Services, and potentially a VirtualService. Your CI/CD pipeline needs to understand kubectl argo rollouts commands to promote or abort.

The most common pitfall? Misconfigured service selectors. The labels on your pods and the selectors in your stableService and canaryService must align perfectly with the Rollout’s pod template. If they don’t, traffic won’t flow correctly, and you’ll be left staring at a confused rollout. Always test your preview environment in a staging cluster first.

But once it’s set up, it’s transformative. You deploy with confidence, not hope. You catch failures when 5% of your users are affected, not 100%. And that, my friend, is how you graduate from sledgehammers to scalpels.