36.8 CloudTrail Lake: Querying CloudTrail Events with SQL

Right, so you’ve got your CloudTrail logs flowing into a Lake. Congratulations, you’ve successfully moved your digital haystack from one barn (S3) to a slightly more organized barn (Lake). But now what? You’re staring at petabytes of JSON blobs thinking, “There has to be a a better way to find this one specific API call than grep.” There is. It’s called SQL, and CloudTrail Lake’s query feature is your new best friend. It lets you interrogate that mountain of audit data without having to load it into another service or, heaven forbid, download it. Let’s cut through the marketing fluff and get to how it actually works.

The Anatomy of a Lake Query

First, understand what you’re working with. A CloudTrail Lake event is still just a JSON object, but the query engine flattens it into a SQL table for you. The magic is in how it structures that table. The main event data lives in a pseudo-table called eventData, and it’s got all the usual suspects: eventName, eventSource, userIdentity.arn, requestParameters, responseElements.

Here’s the kicker: those requestParameters and responseElements fields are the wild west. They are dynamic JSON structures that change based on the API call. Querying them requires a bit of finesse. You don’t just select them; you use JSON functions to extract the specific key you care about, like requestParameters.bucketName for a GetObject call.

A basic query to find who deleted a specific S3 bucket looks like this:

SELECT
    eventTime,
    userIdentity.arn,
    eventName,
    json_extract_scalar(requestParameters, '$.bucketName') as bucketName
FROM eventData
WHERE
    eventSource = 's3.amazonaws.com'
    AND eventName = 'DeleteBucket'
    AND json_extract_scalar(requestParameters, '$.bucketName') = 'my-precious-data-bucket'

Notice the json_extract_scalar function? That’s your key to unlocking those nested JSON fields. The $ denotes the root of the JSON object, and then you use dot notation to traverse it.

Why JSON_EXTRACT and Not Just Dots?

You might be screaming at your screen, “Why can’t I just write requestParameters.bucketName?!” I feel your pain. The reason is annoyingly logical: the schema of requestParameters isn’t fixed. For a RunInstances call, it’s filled with imageId and instanceType; for a CreateFunction, it’s runtime and handler. The engine can’t pre-define a column for every possible parameter across all of AWS’s thousands of APIs. So, it dumps the whole mess into a JSON string and makes you use JSON functions to pick out the bits you need. It’s a little more typing, but it’s infinitely more flexible.

Taming the Wild West with Common Fields

Don’t get so lost in the JSON weeds that you forget about the powerful, consistent fields you can filter on easily. These are your first and most important lines of defense:

SELECT
    eventTime,
    eventSource,
    eventName,
    userIdentity.type,
    userIdentity.arn,
    errorCode,
    errorMessage
FROM eventData
WHERE
    eventTime >= '2023-10-01 00:00:00'
    AND eventTime < '2023-10-02 00:00:00'
    AND eventSource = 'iam.amazonaws.com'
    AND errorCode IS NOT NULL

This query is gold. It finds all IAM API failures from a specific day. Maybe someone tried to do something they didn’t have permission for? This is how you find out. Filtering on eventTime, eventSource, eventName, and errorCode is fast and efficient. Always start your queries with these filters to narrow down the dataset before you use a JSON function, which is more computationally expensive.

The Gotchas: Performance and Pricing

Let’s be direct: this is not a free tool, and it’s not a Spark cluster. You pay per GB of data scanned by your query. This is where most people get a nasty surprise on their bill.

Pitfall #1: A SELECT * FROM eventData query for a month across all regions will scan terabytes of data and cost you hundreds of dollars. Don’t do that. Always, always include a tight WHERE clause on eventTime. Make it a habit.

Pitfall #2: Using a JSON function on the left side of a WHERE clause. This forces the engine to scan every single record to evaluate the function, murdering your performance and your wallet.

-- 😡 TERRIBILE. DON'T DO THIS.
SELECT *
FROM eventData
WHERE json_extract_scalar(requestParameters, '$.bucketName') = 'my-bucket'

-- 😊 MUCH BETTER. Filter on indexed fields first.
SELECT *
FROM eventData
WHERE
    eventSource = 's3.amazonaws.com'
    AND eventName = 'DeleteBucket'
    AND json_extract_scalar(requestParameters, '$.bucketName') = 'my-bucket'

The second query uses the highly optimized eventSource and eventName fields to narrow down the dataset to just S3 DeleteBucket calls before it even thinks about applying the JSON function. We’re talking about a query that might scan megabytes instead of gigabytes.

Beyond the Basics: Joins and CTEs

Yes, you can even do joins. A powerful pattern is joining your CloudTrail data with AWS Organizations data to see events by OU or account name.

WITH suspicious_events AS (
    SELECT *
    FROM eventData
    WHERE eventName = 'AssumeRole'
    AND eventTime > '2023-10-01 00:00:00'
)
SELECT
    s.eventTime,
    s.userIdentity.arn,
    s.requestParameters.roleArn,
    o.AccountName
FROM suspicious_events s
INNER JOIN my_org_data o ON o.AccountId = s.recipientAccountId

This assumes you’ve ingested your Org data into Lake, which is a whole other topic, but you get the idea. The power is immense. You’re not just querying logs; you’re building a narrative of what’s happening across your entire estate. Just remember to mind your filters, or you’ll also be building a narrative for your finance department on why your cloud bill spiked.