24.5 Health Checks: Endpoint, Calculated, and CloudWatch Alarm Checks

Right, let’s talk about Route 53 Health Checks. This is where DNS stops being a simple, dumb phonebook and starts getting a brain. The core idea is gloriously simple: if an endpoint is sick, stop sending people to it. The implementation, however, has more knobs and levers than a spaceship cockpit, and some of them are just as confusing. I’m here to guide you through it so you don’t accidentally eject yourself into space.

The Three Flavors of Health Checks

Route 53 offers you three distinct ways to determine if something is healthy. Picking the right one is 90% of the battle.

First, you’ve got your Endpoint Checks. This is the classic, “can I call this thing and get a good response?” check. You give Route 53 a domain name or an IP address, a port, and a path, and it’ll periodically send HTTP, HTTPS, or TCP requests to it. For HTTP(S), if it gets a response with a status code you like (usually 2xx or 3xx) within a timeout you specify, it’s healthy. Simple, right? This is your go-to for checking on individual web servers, load balancers, or any resource that has a directly accessible endpoint.

Next up are Calculated Checks. This is where it gets clever. You don’t check the endpoint itself; instead, you create a parent check that monitors the status of other health checks—a mix of endpoint checks and even other calculated checks. The parent check becomes healthy based on a formula you define, like “be healthy if 3 out of my 5 backend health checks are healthy.” This is your tool for building redundancy. You wouldn’t use a single calculated check for a single server; that’s like using a calculator to figure out 1+1. You use it to represent the health of an entire region or a multi-AZ application.

Finally, there’s the CloudWatch Alarm Check. This is Route 53’s way of saying, “I don’t do the checking myself; I’ll just ask the expert.” You point the health check at an existing CloudWatch Alarm. The health check’s status directly mirrors the alarm’s state (OK = healthy, ALARM = unhealthy). This is how you bridge the gap between infrastructure-level metrics (like CPU utilization, latency, or a custom metric from your application) and DNS-based failover. If your app is throwing a “database is on fire” error metric into CloudWatch, you can create an alarm for that and then a Route 53 health check that fails over DNS when the alarm triggers.

Configuring an Endpoint Check: The Devil’s in the Details

Let’s build a basic HTTP endpoint check with the AWS CLI. It looks straightforward until you start thinking about the edge cases.

aws route53 create-health-check \
  --caller-reference my-app-health-check-$(date +%s) \
  --health-check-config '
{
  "IPAddress": "192.0.2.10",
  "Port": 443,
  "Type": "HTTPS",
  "ResourcePath": "/health",
  "FullyQualifiedDomainName": "api.example.com",
  "RequestInterval": 30,
  "FailureThreshold": 3,
  "MeasureLatency": true,
  "EnableSNI": true
}
'

Now, let’s pick apart the important bits everyone glosses over. The CallerReference must be unique every time you call this, hence the date +%s trick. Forget this, and the API will yell at you.

RequestInterval: This is how often Route 53 checks your endpoint. 30 seconds is standard, 10 seconds is available but costs more. Here’s the first “questionable choice”: the checkers don’t come from a single IP. They come from a massive, global pool of IPs published in a… drumroll …text file on an AWS docs page. You must allowlist this entire IP range in your security groups or NACLs. It’s absurd, but it’s the only way.

FailureThreshold: This is crucial. This isn’t “how many failed requests.” It’s “how many consecutive failed checking intervals.” With a 30-second interval and a threshold of 3, your endpoint can be down for nearly 90 seconds (30s * 3) before Route 53 declares it unhealthy. This is a good thing—it prevents flapping from a single network blip.

EnableSNI: If you’re using HTTPS and your endpoint serves multiple domains (virtual hosting), you MUST set this to true and provide the FullyQualifiedDomainName. If you don’t, the TLS handshake will fail because the Route 53 checker won’t know which certificate to ask for, and your health check will fail miserably. I’ve seen this burn people more times than I can count.

The Calculated Check: Boolean Logic for Your Infrastructure

Let’s say you have three identical application servers behind a load balancer. You want to fail DNS away if a majority of them fail. First, create health checks for each individual instance (e.g., hc-1a2b3c4d, hc-5e6f7g8h, hc-9i0j1k2l). Then, create the calculated check that acts as the parent.

aws route53 create-health-check \
  --caller-reference calculated-parent-$(date +%s) \
  --health-check-config '
{
  "Type": "CALCULATED",
  "Inverted": false,
  "HealthThreshold": 2,
  "ChildHealthChecks": ["hc-1a2b3c4d", "hc-5e6f7g8h", "hc-9i0j1k2l"]
}
'

The magic is in HealthThreshold: 2. This means the calculated check will be healthy if at least 2 of the 3 child checks are healthy. You’re defining the minimum viable pool size. If you set it to 3, it becomes an “ALL” check. Set it to 1, and it’s an “ANY” check (which is a terrible idea for failover). This is how you build resilience without making your DNS failover overly sensitive.

Pitfalls and Best Practices: Listen to Your Brilliant Friend

Here’s the real-world advice you won’t find in the vanilla docs.

The Silence Problem: A health check can fail for two reasons: your server said “I’m sick” (4xx/5xx response) or your server said nothing at all (timeout). You must handle both. If your /health endpoint gets overloaded and can’t respond in the RequestInterval time (default 2 seconds for HTTP, 4 for TCP), it will time out and count as a failure. Make your health endpoint stupidly lightweight and fast.

The Cascade of Failure: Never, ever point a health check at the same DNS record that it’s supposed to be protecting. You’ll create a feedback loop. If your record app.example.com fails over, and your health check is configured to check app.example.com, the health checker will now be following the DNS failover and checking the new endpoint. It will see it as healthy and flip the original record back, causing flapping. Always health check a stable endpoint, like a specific load balancer DNS name or an IP, not the aliased record you’re managing.

Latency Measurements: If you enable MeasureLatency, Route 53 will log the time it takes to get a response and publish it to CloudWatch. This is fantastic data. But remember, these measurements are from the AWS health checker network to your endpoint. It’s a great relative measure (is latency spiking?) but not an absolute measure of your user’s experience, as they aren’t coming from the same locations.

Use health checks to fail over geographically. If your main region is having a bad day and latency for checks from US-East-1 is fine but from EU-West-1 is through the roof, you can create a latency-based record that sends European users to your healthy European endpoint. It’s one of the most powerful, yet underutilized, features. Just don’t overcomplicate it right out of the gate. Start simple, then add the spaceship levers one at a time.