35.5 Logs Insights: Querying Logs with a SQL-Like Language
Alright, let’s talk about Logs Insights. This is the part where we stop just collecting logs and start actually using them. You’ve been dumping text into a log group for ages, treating it like a black box that you only open during a five-alarm fire. No more. Logs Insights gives you a SQL-ish language to crack that box open and ask it pointed questions. It’s not full SQL, mind you—the CloudWatch team took SQL out back, did some… modifications… and brought back something that’s both powerful and occasionally infuriatingly different. But we work with what we have.
The magic here is that you’re not just grepping text files. CloudWatch automatically parses common log formats (like JSON, Vended Logs from AWS services, or even your custom stuff if you help it out) into structured fields. Suddenly, that string of text becomes a queryable row. It’s the difference between looking for a needle in a haystack and having a magnet.
The Basic Anatomy of a Query
Every query follows a simple pattern: you fields what you want to see, you filter down the mountain of logs to the ones you care about, and you sort and limit the results so you don’t melt your browser. The stats command is where the real power lies, letting you aggregate and calculate like a pro.
Let’s say you have a Lambda function spitting out JSON logs. A simple query to find all errors would look like this:
fields @timestamp, @message, @logStream
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
But that’s brute force. Since your logs are structured JSON, you can do better. Let’s assume your log event has a level field and a requestId field.
fields @timestamp, @message, requestId
| filter level = "ERROR"
| sort @timestamp desc
| limit 20
See the difference? No regex needed. We’re querying on a discrete field. This is why structuring your logs (hint: just print JSON) is the single biggest favor you can do for your future self.
Aggregating with stats - This is Where You Earn Your Pay
The stats command is the workhorse. You can group by any field and calculate counts, averages, percentiles—the usual suspects. Let’s find the number of errors per log stream in the last hour.
filter level = "ERROR"
| stats count() by @logStream
Want to see the average duration of your Lambda requests, but only for successful ones? Easy, assuming you have a duration field.
filter level = "INFO" and duration > 0
| stats avg(duration) by bin(5m)
That bin(5m) is crucial. It groups your data into 5-minute time buckets, giving you a time series right there in your results.
The Quirks and Rough Edges
Now, the part the official docs politely ignore. First, the language is case-sensitive. fields works, FIELDS will give you a cryptic syntax error. It’s a small thing that will bite you at 3 AM.
Second, the parsing isn’t psychic. If your application logs plain text like [ERROR] Failed to connect to database, you’ll be stuck using regex patterns in your filter clauses forever. The parse command can help rescue you from this hell. It lets you use a glob pattern to rip fields out of unstructured messages.
fields @message
| parse @message "[*] *" as logLevel, errorMessage
| filter logLevel = "ERROR"
| display errorMessage
It’s a lifeline, but it’s clunky. Just log JSON. I’m not kidding.
Third, and this is a big one, the query time range is your secret weapon. The service only scans logs within the time range you set in the console UI. Need to query faster? Make the time range smaller. The scan is what costs you money and time, so be surgical. Don’t run a query over 30 days if you’re just debugging yesterday’s deploy.
Finally, remember this is a query language, not a database. You can’t JOIN across log groups. Each query runs against a single log group (or a set of them). If you need to correlate events from different sources, you’ll be exporting to S3 and using something like Athena. It’s the classic AWS trade-off: incredibly fast and integrated for a specific scope, but a walled garden.
Use it for what it’s brilliant at: rapid, ad-hoc investigation and building those critical dashboards that tell you the why behind your metrics. Stop guessing. Start querying.