34.8 AWS Macie: Discovering and Protecting Sensitive Data in S3
Right, let’s talk about Macie. You’ve probably got a ton of data in S3. So do I. And if you’re anything like me, you’ve occasionally dumped a file into a bucket and thought, “I’ll deal with the permissions later,” only to develop a form of data amnesia so profound you’d forget your own password. Macie is the expensive, slightly judgy friend that shows up and tells you that your “later” has arrived and it’s not pretty.
In a nutshell, AWS Macie is a data security service that uses machine learning and pattern matching to automatically discover, classify, and protect your sensitive data—primarily in S3. It’s looking for Personally Identifiable Information (PII), financial data, credentials, you name it. It’s the thing that prevents you from accidentally making a bucket called company-payroll-stuff public because you were tired and clicked the wrong button.
How Macie Actually Works: It’s Not Magic, It’s Regex (Mostly)
Don’t let the ML marketing fool you into thinking it’s pure AI sorcery. A lot of the heavy lifting is done by good old-fashioned regular expressions and keyword matching, though the machine learning models are genuinely clever for spotting more nuanced data like medical records or custom identifiers. Macie runs two types of jobs:
- Discovery jobs: These are your scouts. They go out and crawl your S3 buckets (all of them, or a subset you specify) to build an inventory of what you have and where your sensitive data might be lurking. This is your first and most important step. You can’t protect what you don’t know you have.
- Classification jobs: These are the specialists. They take the results from the discovery job and dive deep into the actual files to confirm if the sensitive data is really there. This is where it opens your PDFs, Word docs, and CSVs to check inside.
Here’s the kicker: you don’t just turn it on and walk away. You have to tell it what to look for. Thankfully, AWS provides a set of managed data identifiers for common things like US social security numbers, credit card numbers, and AWS credentials. You can also create your own custom data identifiers for things unique to your business, like your employee ID format or a project codename that should never leave the internal network.
Let’s say you have a custom patient ID that follows the pattern PAT-1234-ABCD. You could define a custom identifier for it.
# Example using the AWS CLI to create a custom data identifier
aws macie2 create-custom-data-identifier \
--name "Patient-ID-Pattern" \
--regex "PAT-[0-9]{4}-[A-Z]{4}" \
--description "Detects our custom patient identifier format" \
--maximum-match-distance 10
The Brutal Truth About Cost and Performance
Let’s have the direct talk AWS’s pricing page won’t. Macie is expensive. It charges you based on the number of GB it processes. A discovery job that scans 100TB of data will cost you a small fortune. And no, you can’t just point it at your petabyte-scale data lake and hit “go” without getting a call from finance.
The best practice, and I cannot stress this enough, is to be surgical. Don’t run a full discovery on every bucket every week. Start with a targeted scope.
- Identify critical buckets: Use AWS Config, IAM Access Analyzer, or even your own gut feeling to find buckets that might contain sensitive data. Start with your
finance,hr, andlogsbuckets. - Run a one-time discovery job on just those buckets to get a baseline.
- Schedule recurring jobs only on buckets that are frequently updated or are high-risk. For static, archived data, a one-time scan is often enough.
Here’s how you’d create a targeted discovery job using the CLI, because the console will try to tempt you into scanning everything:
# Create a one-time discovery job for two specific buckets
aws macie2 create-classification-job \
--job-type ONE_TIME \
--name "Targeted-HR-Scan" \
--s3-job-definition '{"bucketDefinitions": [{"accountId": "123456789012", "buckets": ["hr-applications", "payroll-export-backups"]}]}'
The “So What?” Factor: Findings and Automation
Okay, great. Macie found 10,000 files containing PII. Now what? This is where Macie stops being just a fancy search tool and starts earning its keep. Its findings are integrated with AWS Security Hub and can trigger Amazon EventBridge events.
This is your golden ticket to automation. You can set up rules that automatically remediate issues. For example, if Macie finds a bucket containing PII that is inexplicably public, you can have an EventBridge rule trigger a Lambda function to lock it down immediately.
# A simplistic Lambda function triggered by an EventBridge rule for a Macie finding
import boto3
import json
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Parse the Macie finding detail from the EventBridge event
finding = event['detail']['findings'][0]
# Check if the finding type is about a publicly accessible bucket
if 'Policy:IAMUser/S3BucketPublic' in finding['type']:
bucket_name = finding['resources'][0]['id'].split(':')[-1]
# Apply a block public access policy to the bucket
s3.put_public_access_block(
Bucket=bucket_name,
PublicAccessBlockConfiguration={
'BlockPublicAcls': True,
'IgnorePublicAcls': True,
'BlockPublicPolicy': True,
'RestrictPublicBuckets': True
}
)
print(f"Locked down public bucket: {bucket_name}")
return {'statusCode': 200}
The Rough Edges and Questionable Choices
Macie is powerful, but it has its quirks. The console interface for browsing findings can be painfully slow when you have thousands of them. The temptation to over-tune custom identifiers is real—you’ll spend hours tweaking regex only to get a million false positives. And the cost model feels punitive if you’re not extremely careful. The designers also made the classic AWS move of creating a service (Macie v1) and then effectively replacing it with a new, slightly different one (Macie v2), so watch out for outdated blog posts.
The bottom line? Use Macie, but use it wisely. Start small, automate your responses, and always, always know what you’re about to scan before you hit that button. It’s the closest thing you’ll get to a conscience for your S3 storage, and like a good conscience, it’s occasionally annoying but usually right.