35.4 CloudWatch Logs: Log Groups, Log Streams, and Retention Policies

Right, let’s talk about CloudWatch Logs. This is where your application’s hopes, dreams, and, more importantly, its panicked error messages go to live. It’s the system of record for everything that happens in your AWS universe, but it’s not just a dumb text file in the sky. It has a specific, occasionally infuriating, structure you need to grasp.

At its core, CloudWatch Logs is built on two concepts: Log Groups and Log Streams. Think of a Log Group as a folder for a specific type of log. You might have a log group for /api/app, another for /api/auth, and another for your Lambda function my-broke-function. The log group is where you set the big, important policies, like retention.

Inside each log group, you have Log Streams. A log stream is a sequence of log events that share the same source. If a Log Group is the folder, a Log Stream is the individual log file within that folder. For an EC2 instance, each instance will typically create its own log stream inside the group. For a Lambda function, every new execution environment (i.e., every cold start) spins up a new log stream. This is brilliant for isolation but can be maddening when you’re trying to trace a single request through a system that’s constantly creating new streams.

The Nuts and Bolts of Groups and Streams

You don’t create streams manually. AWS agents (like the CloudWatch Agent on EC2) or services (like Lambda) create them on the fly when they need to send log data. Your job is to create the log group, which is often done automatically by whatever service is logging, but it’s better to be explicit. Here’s how you’d do it with the AWS CLI:

aws logs create-log-group --log-group-name "/my/applications/super-app"

Why would you do this manually? Control. When a service creates it for you, it often uses default settings, like the dreaded “Never Expire” retention policy. Creating it yourself lets you set sane defaults from the get-go.

Now, sending a log event. You can’t send an event to just a log group; you must specify a log stream. If the stream doesn’t exist, the call will fail. So the standard practice is to try to describe the stream first, and if it doesn’t exist, create it. The AWS SDKs handle this nitty-gritty for you, but it’s why the agents exist—to abstract away this slightly clunky sequence.

# This is the kind of logic the agent handles. First, find or create the stream.
STREAM_NAME=$(aws logs describe-log-streams --log-group-name "/my/app" --log-stream-name-prefix "instance-i-123" --query 'logStreams[].logStreamName' --output text)

if [ -z "$STREAM_NAME" ]; then
  # Stream doesn't exist, create it
  aws logs create-log-stream --log-group-name "/my/app" --log-stream-name "instance-i-123"
  STREAM_NAME="instance-i-123"
fi

# Now you can put a log event into the stream
aws logs put-log-events \
  --log-group-name "/my/app" \
  --log-stream-name "$STREAM_NAME" \
  --log-events timestamp=$(date +%s%3N),message="This is a structured log event. I hope you're happy."

See? A bit tedious. Use the agent.

Taming the Storage Beast: Retention Policies

This is the most important setting on your log group and the one most often forgotten until you get a terrifying bill. By default, log groups are set to Never Expire. This is AWS’s way of saying, “We’re happy to let you store petabytes of debug logs from 2016 forever, just please don’t complain about the invoice.”

You must, must, MUST set a retention policy. It’s trivial to do. The policy automatically deletes any log events older than the specified period. It’s set-and-forget data hygiene.

# Set your log group to retain logs for only 30 days. Be ruthless.
aws logs put-retention-policy --log-group-name "/my/app" --retention-in-days 30

Your options are: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653 days. Yes, 3653 is roughly 10 years. I have no idea who needs a 10-year retention policy for CloudWatch Logs, but I suspect they are also the reason we can’t have nice things.

The best practice? Start aggressive. 30 days is plenty for most debugging and operational needs. For audit-related logs, maybe push to 365. Anything longer and you should be asking if this data shouldn’t be shipped to a cheaper, long-term storage solution like S3 Glacier, because CloudWatch Logs is decidedly not cheap for archive.

The Quirks and How to Live With Them

First, the sequence token. Notice I omitted it from the put-log-events example? That was intentional. Every time you upload a batch of logs, the API returns a nextSequenceToken. You are supposed to use that token for your next upload to ensure events are ingested in the correct order. If you mess this up, you get an error. The agents, again, handle this for you. If you’re rolling your own integration, be prepared to manage this statefulness. It’s a pain.

Second, the search. Don’t expect to grep a petabyte of logs in seconds. CloudWatch Logs Insights is powerful, but it’s a query engine, not a real-time tail. The latency between an event happening and it being searchable can be several seconds to minutes. This is the trade-off for managed, scalable log ingestion. It’s not a flaw, just a design choice you need to be aware of when you’re frantically debugging a production issue at 3 AM. Plan your logging strategy accordingly—log thoughtfully, with structure (JSON!), so you can actually find what you need without resorting to brute-force scans.