26.7 Alertmanager: Routing and Silencing Alerts

Alright, let’s get our hands dirty with Alertmanager. You’ve set up Prometheus, it’s firing alerts, and now your inbox is getting flooded because InstanceDown is pinging for that one dev node everyone knows fails every Tuesday. This is where Alertmanager earns its keep. It’s not just a dumb forwarder; it’s the traffic cop, the bouncer, and the notification router for your entire alerting system. Its job is to take the firehose of alerts from Prometheus and route them to the correct people, in the correct way, and only when it absolutely should.

The Almighty Routing Tree

At the heart of Alertmanager is the route tree, defined in your alertmanager.yml file. Think of it as a flowchart for your alerts. Every alert starts at the root route, and it trickles down through the tree, getting matched against properties like alertname, severity, or any custom label you’ve added. The first route that matches wins.

Why a tree? Because it’s incredibly efficient and lets you build complex routing logic from simple, nested rules. You set your default, catch-all policies at the root, and then you carve out more specific exceptions as you go deeper. Here’s a basic but functional example:

route:
  receiver: 'default-team-email'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    - matchers:
        - name: severity
          value: page
      receiver: 'oncall-pagerduty'
      repeat_interval: 1h # Page me every hour until it's fixed, thanks.

    - matchers:
        - name: team
          value: frontend
      receiver: 'frontend-slack-channel'
      routes:
        - matchers:
            - name: alertname
              value: HighRequestLatency
          repeat_interval: 30m # Bug the frontend team more often about latency.

See what we did there? The root catches everything and sends it to email. But if an alert has severity="page", it immediately gets routed to PagerDuty instead. Then, we have a route for anything with team="frontend", which goes to Slack. Within that frontend branch, we have an even more specific rule for the HighRequestLatency alert that changes the repeat interval. The tree structure makes this hierarchy logical and maintainable. The most common pitfall here is putting a too-general route above a specific one and having it greedily catch everything. Order matters. Always put your most specific routes first.

Silencing: For When You Know the Sky Isn’t Actually Falling

Let’s be honest. Sometimes you need to shut an alert up. You’re rebooting the entire database cluster. You know everything will be down for the next ten minutes. The last thing you need is a symphony of pagers going off. This is where silencers come in.

Silences are temporary muzzles you place on alerts that match a set of label matchers. You create them either through the Alertmanager web UI or its API. They are the correct way to handle planned outages, not just turning off alerts entirely.

The web UI is dead simple: you click “New Silence,” specify the labels (e.g., instance=~"db-.*", alertname="InstanceDown"), set a duration, and add a comment like “Database reboot, go back to sleep.” The key here is the comment. Always. Add. A Comment. Future-you, or your colleague at 3 AM, will have no idea why this critical silence exists, and they’ll be too scared to remove it. A good comment tells everyone this is a planned, temporary action.

The programmatic way is more powerful:

curl -X POST -H "Content-Type: application/json" \
  http://alertmanager:9093/api/v2/silences \
  -d '{
    "matchers": [
      { "name": "alertname", "value": "NodeNetworkDown", "isRegex": false },
      { "name": "zone", "value": "us-west-1a", "isRegex": false }
    ],
    "startsAt": "2023-10-26T19:00:00.000Z",
    "endsAt": "2023-10-26T21:00:00.000Z",
    "createdBy": "your-username",
    "comment": "Network hardware maintenance in us-west-1a"
  }'

The biggest “gotcha” with silences? They match on the labels on the alert itself, not the labels on the underlying metric. If your alert rule doesn’t attach a zone label, you can’t silence by it. This is why your alerting rules need to be richly labeled—it gives you the granularity to silence precisely without accidentally muting something important.

Inhibition: Stopping the Alert Cascade

Here’s a brilliantly absurd but common scenario: your entire AWS availability zone catches fire. You get a page for AZDown. Then you get a page for InstanceDown for every node in that AZ. Then you get a page for PodDown for every pod on those nodes. Then you get a page for HighErrorRate for every service those pods ran. Your pager melts. You cry.

Inhibition is how you stop this. It’s a rule that says, “If this high-severity alert is firing, automatically silence any other alerts that are consequentially related.” You configure this in your alertmanager.yml:

inhibit_rules:
  - source_matchers:
      - name: severity
        value: critical
    target_matchers:
      - name: severity
        value: warning
    equal: ['zone', 'cluster']

This rule reads: “If a critical alert fires, inhibit any warning alerts that have the same zone and cluster label values.” So when AZDown (severity: critical) fires, it automatically suppresses all those InstanceDown (severity: warning) alerts in the same zone, preventing the notification storm. It’s a brutal but necessary tool for keeping your alert volume sane during a real catastrophe. Use it judiciously; you don’t want to accidentally suppress a warning that’s actually the root cause.