Right, so you’ve decided you don’t want your entire application to just fall over and die because a single server gets the sniffles. Good call. Welcome to Failover Routing in Route 53, the digital equivalent of having a backup generator that automatically kicks in. The concept is beautifully simple: you have a primary endpoint (the one you want to handle all the traffic) and a secondary endpoint (the one that sits around, sipping margaritas, until the primary catches on fire). Route 53, playing the role of a hyper-vigilant fire marshal, uses health checks to decide which one to send users to.

The magic, and the complexity, is all in the details. Let’s get into it.

How It Actually Works: The Record Itself

You configure this using a special type of record—a failover record—within a record set. Don’t bother looking for a separate “Failover” routing policy in the console; it’s just an option within the routing policy dropdown. You’ll create two records with the same name: one with a Primary failover role and one with a Secondary failover role.

Here’s the Terraform for a classic scenario: failing over from a primary site to a static “we’re down for maintenance” page hosted on S3.

# Health check for the primary application (e.g., an EC2 instance)
resource "aws_route53_health_check" "primary_app_check" {
  fqdn              = "primary-app.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  request_interval  = 30
  failure_threshold = 2
}

# The primary record. This is the star of the show.
resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "myapp.example.com"
  type    = "A"
  ttl     = "60"

  records = ["192.0.2.10"] # Your primary server's IP

  set_identifier = "primary-server"
  failover_routing_policy {
    type = "PRIMARY"
  }
  health_check_id = aws_route53_health_check.primary_app_check.id
}

# The secondary record. This is the understudy.
resource "aws_route53_record" "secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "myapp.example.com"
  type    = "A"
  ttl     = "60"

  records = ["192.0.2.99"] # Your failover server's IP (or S3/CloudFront alias)

  set_identifier = "secondary-failover-page"
  failover_routing policy {
    type = "SECONDARY"
  }
}

Route 53’s job is now to continuously ping the health check attached to the primary record. If it starts failing based on the criteria you set (failure_threshold), it gracefully (and automatically) stops serving the primary IP and starts serving the secondary IP to all resolvers.

The Health Check: This is Where You Get Burned

The health check is the brains of the operation, and it’s also the most common point of failure. You can’t just check if a server is powered on; you need to check if your application is actually healthy. That /health endpoint you specified? It must return a 200 OK status code and the first 5120 bytes of the response body must contain the string you optionally specify. No 200? No healthy status. It’s brutally simple.

Here’s the pitfall: your health check endpoint needs to be lightweight and not require any of the very services you might be checking. If your /health endpoint tries to connect to a database that’s down, it might hang or return a 500, marking the entire endpoint as unhealthy. This can cause a failover when the web server is perfectly fine—it’s just the database that’s sad. Design your health checks to be as atomic as possible.

The Subtle Art of Failback

So the primary went down, traffic failed over to the secondary, and you’ve now frantically fixed the primary. What happens? This is called failback, and it’s not instantaneous. Route 53 will wait for the primary’s health check to pass consistently (based on the failure_threshold) before it will even consider switching traffic back. This is a good thing! It prevents your users from being bounced back and forth like a ping-pong ball if the primary is flaky and intermittently recovering.

But you need to be aware of TTL here. DNS resolvers and caches across the internet will have stored the secondary’s IP address for the duration of the TTL you set on the record (60 seconds in our example). Even after Route 53 starts pointing back to the primary, it can take up to that TTL for everyone in the world to see the change. Never set a failover TTL higher than 60 seconds unless you have a very good reason and understand the recovery lag you’re introducing.

The Secondary’s Job Isn’t to Be a Perfect Clone

A crucial, often-missed insight: your secondary endpoint does not need to be an identical, full-scale replica of your primary. In fact, for many use cases, it’s a colossal waste of money. Its job is purely business continuity. It could be:

  • A static “Sorry, we’re experiencing issues” page on S3/CloudFront (the example above).
  • A read-only mode of your application.
  • A completely different application stack in another region.

The health check for the secondary is optional but highly recommended. If you don’t set one, Route 53 will assume the secondary is always healthy and can always receive traffic. If you do set one, and the secondary also fails, Route 53 will… well, it’ll do the only logical thing left: it will return no answer. The user will get a DNS resolution error. This is arguably better than sending them to a broken site, so configure a health check on your secondary if you can. It forces a hard fail, which is better than a silent, broken failover.

So, to recap: design bulletproof health checks, mind your TTLs, and build a secondary that’s good enough to handle the traffic, not necessarily a perfect clone. Do that, and you’ve just installed that backup generator. Now go get yourself a margarita; you’ve earned it.