35.3 CloudWatch Alarms: Threshold, Anomaly Detection, and Composite Alarms

Right, CloudWatch Alarms. This is where we move from passively watching your infrastructure’s weird little performance art piece to actually yelling at it when it misbehaves. An alarm is a state machine that watches a single metric and does something when that metric crosses a threshold for a certain period. It’s your system’s way of tapping you on the shoulder and saying, “Hey, I think I’m on fire. Or maybe I’m just cold. You should probably look into that.”

The genius—and the frustration—of CloudWatch Alarms is their simplicity. They are deliberately dumb. They don’t reason about causality; they just compare a number to another number. Your job is to make that simple comparison meaningful, which is where most people screw it up.

The Anatomy of a Standard Threshold Alarm

Let’s break down the crucial bits you’ll configure. It’s not just “CPU > 90%”.

Metric & Statistic: You’re not alarming on a single data point. You’re alarming on a statistic (Average, Maximum, Sum, etc.) calculated from all the data points published during your chosen period. A period of 5 minutes with an Average statistic means it collects all the data points for 5 minutes, averages them, and then uses that single averaged value for the threshold check. This is your first line of defense against noisy, flappy metrics.
Threshold: The actual number. Seems obvious, but the magic is in the breach behavior. Is it GreaterThanOrEqualToThreshold or GreaterThanThreshold? This matters at the boundary. I always use GreaterThanOrEqualToThreshold because I’m paranoid and want to know if we hit the number, not just exceed it.
Datapoints to Alarm & Evaluation Periods: This is the most important part and the one everyone gets wrong. This defines how consistent the breach needs to be before the alarm fires. “2 out of 2” evaluation periods means the metric must be in breach for two consecutive periods without a single good datapoint in between to trigger ALARM. “3 out of 5” is more forgiving; it allows for two flappy, good datapoints within those five periods without triggering. For most production systems, you want at least “2 out of 3” or “3 out of 5” to prevent a single spike from paging you at 3 AM.

Here’s a classic example: an alarm that triggers if the average CPU utilization of an EC2 instance is high for 10 straight minutes.

# A CloudFormation example because it's clearer than the console clicks.
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  CpuAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm if CPU too high for 10 minutes, likely needs attention"
      Namespace: AWS/EC2
      MetricName: CPUUtilization
      Dimensions:
        - Name: InstanceId
          Value: i-1234567890abcdef0
      Statistic: Average
      Period: 300 # 5 minutes. This is the granularity of the check.
      EvaluationPeriods: 2 # How many periods to check? 2 * 5min = 10min.
      DatapointsToAlarm: 2 # How many of those periods must be bad? 2.
      Threshold: 90
      ComparisonOperator: GreaterThanOrEqualToThreshold
      TreatMissingData: breaching # This is crucial! See below.

The “TreatMissingData” Minefield

This setting is a masterpiece of under-the-radar importance. What happens if your instance is stopped, your application crashes and stops emitting metrics, or a network blip causes a data gap? The default behavior is missing, which means the missing point is ignored in the evaluation. This is often disastrous. If you have “2 out of 2” and one point is missing, the alarm can never fire because it only has one data point to evaluate.

For alarms where no data is a bad thing (like a heartbeat check), you must set TreatMissingData: breaching. For alarms where no data is probably fine (like CPU on a stopped instance), notBreaching might be better. For flappy services, ignore keeps the previous state. Think about this. Hard.

Anomaly Detection: When You Have No Idea What “Normal” Is

Setting a static threshold for, say, API latency is a fool’s errand. What’s “bad” on Monday morning is “great” on Saturday night. Anomaly detection uses Machine Learning (a simple ML model, don’t get too excited) to learn the seasonal and weekly patterns of a metric and creates a band of “expected” behavior.

You alarm when the metric crosses outside that band. It’s fantastic for metrics with periodic patterns. The key thing to remember: the model needs about two weeks of data to train properly. Don’t set it up on a Tuesday and expect it to be useful on Wednesday.

# AWS CLI command to create an anomaly detection alarm
aws cloudwatch put-anomaly-detector \
    --namespace "MyCustomNamespace" \
    --metric-name "ApiRequestLatency" \
    --dimensions Name="ApiName",Value="PaymentService" \
    --statistic "Average"

Then you create an alarm that triggers when the metric is outside the anomaly detection band. The console makes this relatively straightforward, but the CLI/CFN is a bit more verbose.

Composite Alarms: Alarming on Other Alarms

This is where we ascend to a higher plane of laziness and efficiency. A composite alarm is a Boolean expression that combines the states of other alarms. Why is this brilliant?

Reduce Noise: Instead of getting 10 pages for 10 different alarms during a deployment, you can create a composite like (ALARM on CPU OR ALARM on Memory) AND (NOT ALARM on DeploymentInProgress).
Cross-Resource Correlations: “Alert me if the database CPU is high AND the application error rate is also high.” This points you toward the root cause immediately.

They are just CloudFormation AND, OR, and NOT operators. The hard part is getting the logic right.

DeploymentAwareCpuAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmDescription: "CPU is high, but only if we're not currently deploying"
    AlarmRule: |
      (ALARM("arn:aws:cloudwatch:us-east-1:123456789012:alarm:MyCpuAlarm"))
      AND
      (NOT ALARM("arn:aws:cloudwatch:us-east-1:123456789012:alarm:DeploymentInProgressFlag"))
    ActionsEnabled: True

The biggest gotcha? You’re now dependent on the other alarms. If you disable one, you break the composite. It’s a dependency tree, so manage it like one.

The bottom line: Alarms are simple tools. Their power comes from your thoughtful application of them. Choose your periods wisely, handle your missing data with intent, and for the love of all that is holy, use composite alarms to build context and silence noise. Your phone battery will thank you.