35.6 CloudWatch Agent: Collecting System-Level Metrics and Application Logs

Right, let’s talk about the CloudWatch Agent. You’ve probably noticed that the default, out-of-the-box CloudWatch metrics for your EC2 instances are… well, they’re pathetic. A few high-level CPU and network stats every five minutes? That’s like trying to diagnose a engine problem by listening to the car from a block away. It’s useless. The CloudWatch Agent is how you fix that. It’s a little daemon you install on your instances to collect a firehose of detailed system-level metrics (like memory, disk, and processes) and, crucially, ship your application logs directly to CloudWatch. Think of it as giving AWS a direct tap into the vitals of your machine.

Why You Need the Agent (The Default Metrics are a Joke)

I’m not kidding about the default metrics. They’re collected from the hypervisor, not the OS inside your instance. This means they have no visibility into memory pressure, disk swap, disk I/O wait times, or anything happening inside a container. If your application is memory-bound and thrashing swap, the default “CPUUtilization” metric will look perfectly fine while your users are screaming. The agent collects these missing metrics from the OS itself, giving you the real picture. It’s the difference between a weather report and having a barometer in the room.

Installing and Configuring the Beast

You don’t just yum install this thing. AWS, in its infinite wisdom, requires a multi-step process. First, you need to attach an IAM role to your instance that has the CloudWatchAgentServerPolicy managed policy. This gives the agent permission to actually send your data. No role, no data. It’ll just fail silently, which is a fantastic user experience.

Next, you need to install the agent. On Amazon Linux 2, it’s part of the extras repository. This is the least painful way:

sudo yum install -y amazon-cloudwatch-agent

For Ubuntu, it’s a .deb download. For others, you’re dealing with a tarball. Consult the docs, but grumble while you do it.

Now, the heart of the operation: the configuration file. The agent uses a JSON file that defines everything it should collect. You can use the amazon-cloudwatch-agent-config-wizard to generate a basic one, but you’ll outgrow it instantly. Let’s look at a more realistic, hand-rolled example. Save this as /opt/aws/amazon-cloudwatch-agent/bin/config.json (the default location it looks for).

{
  "metrics": {
    "metrics_collected": {
      "cpu": {
        "measurement": [
          "cpu_usage_idle",
          "cpu_usage_iowait"
        ],
        "metrics_collection_interval": 60,
        "totalcpu": true
      },
      "disk": {
        "measurement": [
          "used_percent"
        ],
        "metrics_collection_interval": 60,
        "resources": [
          "/",
          "/app"
        ]
      },
      "mem": {
        "measurement": [
          "mem_used_percent"
        ],
        "metrics_collection_interval": 60
      }
    },
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/my-app/app.log",
            "log_group_name": "/my-app/ec2/app.log",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          },
          {
            "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/my-app/ec2/nginx/access.log",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 7
          }
        ]
      }
    }
  }
}

This config does two things:

Metrics: Collects CPU idle/iowait, disk usage for / and /app, and memory usage, all every 60 seconds.
Logs: Ships two log files to CloudWatch Logs, putting them in their own log groups with a sensible retention policy and using the instance ID as the log stream name.

See the {instance_id} and ${aws:InstanceId}? That’s the agent’s built-in credential provider dynamically adding dimension/tags, which is brilliant for slicing your data later.

Starting, Stopping, and Not Screwing It Up

With the config file in place, you start it:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json

This command fetches the config, starts the agent, and sets it up to run on boot. The most common pitfall here? File permissions. The agent runs as the cwagent user. If that user can’t read your application log file (e.g., if it’s owned by root:root with 640 permissions), your logs will never leave the instance. Always chmod and chown your log files appropriately. Check the agent’s own logs at /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log for any permission denied errors.

The Gotchas and Best Practices

Don’t Go Nuts: It’s tempting to collect every metric at a 1-second interval. Don’t. You will drown in data and charges. Start with 60-second intervals and only decrease it for truly critical, volatile metrics.
Centralized Configuration is King: Manually SSH-ing into instances to edit JSON files is for masochists. Use SSM Parameter Store to hold your agent configuration. Then your User Data script can pull the config on boot. This makes changing the collection profile across a fleet of instances a one-change operation.
Mind the Log Volume: Application logs can get chatty. CloudWatch Logs charges by ingestion and storage. That retention_in_days field in the config isn’t a suggestion; it’s a cost control. Be ruthless with it. Use 7 days for debug logs, 30 for access logs, and maybe 90 for audit logs. For high-volume apps, consider shipping to S3 and using Athena instead.
The SSM Agent is Your Friend: The CloudWatch Agent is installed and updated via the SSM Agent. If your SSM Agent is old or broken, none of this works. Keep it updated.