8.3 Health Checks: EC2 vs ELB Health Checks

Right, let’s talk about health checks. This is where your ASG decides which of its children are pulling their weight and which ones are secretly napping on the job. It’s a brutal, automated process, and if you get it wrong, your application will be the one suffering the silent, inexplicable failures. So pay attention.

You have two main choices here, and the one you pick dictates the entire philosophy of your scaling group’s existence. Is it merely a group of machines that have booted up (an EC2 check), or is it a group of machines that are actually serving traffic correctly (an ELB check)?

The EC2 Health Check: “Is It Breathing?”

This is the basic, no-frills check. The ASG pings the Amazon EC2 health check system, which answers one simple question: “Has the hypervisor lost its heartbeat with this instance?” It’s the equivalent of holding a mirror under the instance’s nose to see if it fogs up.

If the instance fails this check (e.g., it kernel panicked, the underlying hardware failed), the ASG will summarily execute it and launch a replacement. It’s brutal, efficient, and profoundly stupid. It doesn’t care if your web server is in a restart loop, if your database connection is borked, or if your application returned a 500 error for the last ten minutes. If the OS is running, it passes.

You’d use this if you’re running something that isn’t a web service, like a batch processing worker that doesn’t have an external-facing endpoint. But for 95% of you, this is the wrong choice.

The ELB Health Check: “Is It Actually Working?”

This is the check you actually want. When you enable this, the ASG delegates its life-and-death decisions to the attached Elastic Load Balancer. The ELB will periodically ping a path you specify (e.g., /health) on each registered instance.

Now, the ASG isn’t asking “Is it breathing?” but “Is it successfully responding to HTTP requests?” This is a world of difference. Your application might be coughing up blood, but if your /health endpoint thoughtfully checks the database connection, the cache, and the disk space before returning a happy 200, the ELB reports “healthy” and the ASG leaves it alone. Conversely, if the instance is perfectly healthy but your app server crashed, the health check fails, and the ASG mercy-kills it.

This is how you ensure real, functional capacity. This is the way.

Here’s the catch: the ELB needs time to make up its mind. It doesn’t just ping once and call it a day. You configure its grace period.

Configuring the Grace Period: The Art of Patience

This is the most common pitfall, so listen up. When a new instance launches, it takes time. Time to boot, time for the caches to warm, time for your Docker container to pull the image and start. If the ELB starts its health checks the second the instance is registered, it will immediately fail, causing the ASG to think “well, this one’s junk” and terminate it. This leads to a heartbreaking loop of instances being born only to be immediately killed, like some kind of tragic AWS Greek myth.

You prevent this with the HealthCheckGracePeriod setting. This is a number of seconds you give the instance to get its act together before the ELB starts evaluating its health checks. Set it too low, and you get the death spiral. Set it too high, and you have broken instances serving traffic for too long.

For a typical web application, 300 seconds (5 minutes) is a sane starting point. Adjust based on how long it truly takes your boot scripts to get the app fully online.

Here’s how you set this up properly in a CloudFormation template. Notice we’re using the ELB health check type and defining that critical grace period.

WebServerAutoScalingGroup:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    ...
    HealthCheckType: ELB
    HealthCheckGracePeriod: 300
    TargetGroupARNs:
      - !Ref MyTargetGroup
    ...

Crafting a Sane Health Check Endpoint

Your health check endpoint is the linchpin of this whole operation. It must be lightweight, but it must also be meaningful. Do not just return a static 200. That’s a lie, and you’re better than that.

A good /health endpoint should:

Return a 200 status code quickly.
Check critical internal dependencies (is the database connection pool healthy? can we talk to Redis?).
NOT be behind authentication. The ELB can’t sign your requests.
NOT do heavy logic. It’s a health check, not a benchmark.

A bad health check is worse than no health check at all because it breeds false confidence.

The Fallback You Didn’t Know You Had

Here’s a fun bit of trivia that’ll save your bacon one day: when using an ELB health check, the ASG doesn’t completely ignore the EC2 status. It uses it as a fallback. If your instance fails the EC2 status check (hardware failure), the ASG will still replace it, even if it’s passing the ELB health check. It’s a good safety net. The ELB check is for your application’s health; the EC2 check is for the infrastructure’s health. You get both, which is honestly one of the few things in AWS that feels thoughtfully designed.

So, to be absolutely clear: for any service that takes HTTP traffic, set HealthCheckType: ELB, define a sensible HealthCheckGracePeriod, and make your /health endpoint actually mean something. It’s the single biggest thing you can do to make your ASG resilient instead of just a fancy way to start and stop EC2 instances.