9.8 ALB Access Logs and CloudWatch Metrics

Right, let’s talk about visibility. You’ve deployed your ALB, traffic is flowing, and everything seems fine. But you’re not flying blind here. You’ve got two phenomenal tools to figure out exactly what your load balancer is doing: Access Logs, which are the raw, unfiltered truth of every single request, and CloudWatch Metrics, which are the digested, high-level summary. One is the detailed transaction history; the other is your monthly bank statement. You need both to get the full picture.

First up, the gossip column of your infrastructure: Access Logs.

Enabling and Configuring Access Logs

This isn’t on by default because, well, AWS is cheap and they know if they logged every request for everyone by default, they’d need to build a new data center just to store the log files. So you have to opt-in. You’ll point the logs to an S3 bucket, which is the right place for them. It’s durable, cheap, and you can analyze them later with Athena or whatever fancy tool strikes your fancy.

Here’s how you do it with the CLI. Notice how you need to give the ALB permission to write to the bucket? That’s a classic “I just spent 45 minutes debugging a permission error” moment, so let’s get it right the first time.

# First, create the bucket (if you haven't already)
aws s3 mb s3://my-alb-logs-bucket

# Create a bucket policy that grants the ALB service permission to write to it.
# Save this as 'bucket-policy.json'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::127311923021:root" # This magical ARN is the AWS ELB service account for your region
      },
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-alb-logs-bucket/AWSLogs/<YOUR_ACCOUNT_ID>/*"
    }
  ]
}

# Apply the policy to the bucket
aws s3api put-bucket-policy --bucket my-alb-logs-bucket --policy file://bucket-policy.json

# Now, enable the logs on the ALB itself. You need the ARN of your load balancer.
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/1234567890abcdef \
  --attributes Key=access_logs.s3.enabled,Value=true Key=access_logs.s3.bucket,Value=my-alb-logs-bucket Key=access_logs.s3.prefix,Value=my-app

The most important part there is that Principal ARN. It looks like you’re delegating access to some random AWS account, but that’s the account for the ELB service in most regions. It’s a trust thing. Just use it.

Interpreting the Goldmine of Data in a Log Entry

Once this is enabled, within a few minutes, you’ll start getting gzipped log files plopped into your S3 bucket. A single line looks intimidating, but it’s a treasure trove. Let’s break down the most useful fields, because type and version are about as exciting as plain toast.

https 2023-10-27T19:15:34.123456Z app/my-alb/1234567890abcdef 192.0.2.1:3456 192.0.2.202:80 0.000 0.002 0.000 200 200 0 100 "GET http://example.com:80/static/image.jpg HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0)" - - arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/1234567890abcdef "Root=1-12345678-1234567890abcdef12345678" "-" "-" 0 2023-10-27T19:15:34.123000Z "forward" "-" "-" "192.0.2.202:80" "200" "-" "-"

request_processing_time (0.000): The time it took the ALB to route the request. This is almost always minuscule. If it’s high, the ALB itself is choking, which is rare and very bad.
target_processing_time (0.002): This is the money field. This is how long your actual application, running on the target, took to respond. This is your application’s latency. You care about this a lot.
response_processing_time (0.000): Time for the ALB to process the response from the target. Also usually tiny.
elb_status_code and target_status_code (both 200): The first is the status code the ALB sent to the client. The second is what your application sent to the ALB. These should be the same. If they’re not, you’ve got a problem. A elb_status_code of 500 or 503 often means it couldn’t even connect to your healthy target (maybe your app is overwhelmed and can’t accept sockets?).
received_bytes and sent_bytes (0 and 100): How much data went in and out. Great for basic traffic analysis.
request (“GET http://…”: The full request, including the host header. Crucial for debugging.
user_agent: “Mozilla/5.0…”: What the client claimed to be. Useful for filtering out bot traffic.
target_group_arn: Which target group served this request. Essential if you have multiple behind one ALB.
trace_id (“Root=1-…”: The AWS X-Ray trace ID. If you’re using X-Ray (and you should for distributed tracing), this links the ALB request to the rest of the trace.

Correlating with CloudWatch Metrics

Access logs are for deep, request-level forensics. CloudWatch Metrics are for answering questions like “Is my API getting slower?” or “Are we getting a ton of 5xx errors right now?” at a glance.

The ALB emits a ton of metrics, but you’ll live in these few:

RequestCount: The total number of requests. Duh.
HTTPCode_ELB_5XX_Count: Errors generated by the ALB itself (e.g., it can’t find a healthy target). This is a critical alarm. If this is non-zero, your application is likely completely down or severely broken.
HTTPCode_Target_5XX_Count: Errors generated by your application. You should also alarm on a high rate of these.
TargetResponseTime: The average of that target_processing_time we saw in the logs. This is your primary latency metric. Graph it. Set alarms on it if it pokes above a threshold.
HealthyHostCount / UnHealthyHostCount: The state of your targets. Another top-tier alarm candidate.

The key insight is to use them together. You see a spike in TargetResponseTime on your CloudWatch dashboard? Jump into the Athena-powered access logs for that same 5-minute period and figure out which requests were slow. Was it all of them? Just ones to a specific endpoint? Maybe just for a specific client? The logs will tell you. The metrics tell you that something happened; the logs tell you what exactly it was.