37.7 CloudFormation Drift Detection

Right, so you’ve deployed your beautiful, pristine stack. It’s a perfect snowflake of infrastructure, exactly as your template intended. You high-fived your team, closed the ticket, and moved on. A week later, someone logs into the console—shudder—and fat-fingers a change on a Security Group, maybe to “just quickly test something.” A month after that, an automated script updates an AMI on an EC2 instance. Your infrastructure is now a liar. It claims to be one thing in your version-controlled template, but in reality, it’s something else. This, my friend, is drift. And it’s the silent killer of your “Infrastructure as Code” religion.

CloudFormation Drift Detection is the feature that acts as your infrastructure’s conscience. It’s the process of asking AWS, “Hey, for this stack I gave you, can you go out and actually check every resource to see if its live configuration still matches what I said it should be in the template?” The answer is often depressing, but always enlightening.

How to Trigger a Drift Detection

You can do this via the AWS CLI, which is my preferred method because it’s scriptable and doesn’t require clicking through a UI that was clearly designed by someone who has never actually debugged a 50-resource drift failure.

aws cloudformation detect-stack-drift --stack-name my-precious-stack

This command doesn’t give you the results; it starts an asynchronous detection process. It’s like asking a particularly pedantic accountant to audit your books. They’ll get back to you. To check on that accountant:

aws cloudformation describe-stack-drift-detection-status --stack-drift-detection-id <your-detection-id>

Once it’s complete, you can get the full, gory details:

aws cloudformation describe-stack-resource-drifts --stack-name my-precious-stack

The output will show you, resource by resource, whether it has MODIFIED, DELETED, or (the holy grail) IN_SYNC along with the specific property differences. It’s a JSON blob, so get comfortable with jq to parse it.

What Actually Gets Checked (The Fine Print)

Here’s where they get you. Not all properties are drift-enabled. For a resource to support drift detection, the service team behind it (e.g., the EC2 team, the RDS team) had to do the work to map the CloudFormation properties to the actual live properties of the resource. This leads to a frankly absurd situation where some properties of a resource are checked, and others are just ignored.

For example, on an AWS::EC2::Instance, the ImageId (the AMI) is drift-enabled. Change it via the console, and Drift Detection will catch it. But what about the InstanceType? Change a t3.micro to a m5.8xlarge and run a drift detection? No drift detected. I wish I was joking. You must always, always check the documentation for each resource type to see which properties are covered. It’s a mess, and it’s AWS’s fault. This isn’t a theoretical limitation; it’s a prioritization and implementation choice they made, and it severely hobbles the feature’s usefulness.

The Inevitable “Cannot Be Drift Detected” Problem

You’ll run drift detection and find some resources have a drift status of NOT_CHECKED. The usual reason? The resource type doesn’t support it. There are still plenty of older resources that haven’t been updated. The other, more infuriating reason, is that the resource’s state is incapable of being checked. If an EC2 instance is stopped, for instance, you often can’t detect drift on it. The system can’t interrogate it fully. This means your drift detection is only as good as your uptime, which is a bizarre paradox.

Best Practices and Pitfalls

Use It Religiously, But Don’t Trust It Blindly: Make drift detection part of your regular audit process, perhaps in a weekly or monthly CI/CD job that fails and alerts you if any drift is detected. But remember the caveats: because of the spotty property coverage, a clean bill of health from drift detection does not mean your stack is perfectly in sync.
There’s No “Revert” Button: This is the biggest letdown. Drift Detection is purely informational. It tells you something is wrong but does absolutely nothing to fix it. You now have a triage problem: Do you update your template to match the reality (if the change was intentional and good)? Or do you manually revert the change in the console or CLI and then update the stack to enforce the correct config? There’s no easy answer.
Drift is a People Problem: The technology is just exposing a process failure. If drift keeps happening, your team doesn’t have a shared understanding of the sanctity of the template. The console is a siren song of quick fixes. The solution is cultural: enforce permissions (using IAM to deny permissions to modify CloudFormation-managed resources outside of CFN itself) and double down on the principle that the template is the single source of truth.

In the end, Drift Detection is a flawed but essential tool. It’s like a smoke alarm that sometimes fails to detect electrical fires but is great for burnt toast. You’d be a fool not to have one, but you’d be a bigger fool to assume it makes you completely safe. Use it, curse its limitations, and let it guide you back to the one true path: the template.