43.7 Sustainability: Understanding Impact, Establishing Goals, Maximizing Utilization

Alright, let’s talk sustainability. You’ve probably heard it called “green IT” and pictured someone hugging a tree while their CI/CD pipeline deploys a carbon-spewing monolith. It’s more nuanced than that. In the AWS context, sustainability is about squeezing every last drop of useful work out of the energy your systems consume. It’s not just good for the planet; it’s a fantastic proxy for cost efficiency and performance. Waste less energy, pay less money. It’s a beautiful, beautiful alignment of incentives.

The core idea is to understand that your code doesn’t run on magic; it runs on physical servers in physical data centers that consume real watts and joules. Our job is to make sure those joules aren’t being spent on heating up a room in Northern Virginia for no reason.

Understanding Your Impact: It Starts with Measurement

You can’t improve what you don’t measure. AWS provides the big, blunt instrument: the Customer Carbon Footprint Tool. This gives you a monthly report of your estimated carbon emissions in terms of kilograms of carbon dioxide equivalent (CO₂e). It’s a great high-level view, but it’s retrospective. To get tactical, you need to get granular.

This is where your old friends, Amazon CloudWatch and the AWS Cost and Usage Report (CUR), become your sustainability dashboard. The key is to stop thinking just about cost ($) and start thinking about resource utilization (%). A CPU running at 5% utilization is a crime against efficiency; it’s burning energy mostly to idle.

Let’s say you want to find your most inefficient EC2 instances. You can use the AWS CLI to get CPU utilization metrics. Don’t just look at the average; look at the p90 or p99 to see if you have instances that constantly spike to 100% for a minute then sit idle for an hour—a classic sign of terrible utilization.

# Get average CPU utilization for a specific instance over 24 hours
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2023-10-27T00:00:00Z \
  --end-time 2023-10-28T00:00:00Z \
  --period 3600 \
  --statistics Average \
  --output json

The output will be a sobering list of numbers. Anything consistently below 10% is a prime candidate for termination or right-sizing.

Establishing Goals: Beyond “Be Greener”

“Let’s be more sustainable” is a wish, not a goal. You need a specific, measurable, and arguably obnoxious key performance indicator (KPI). Here are a few good ones:

Average CPU Utilization: Aim for a minimum of 40% across your non-production fleet and 60-70% for production. This forces you to think about density.
Workload Carbon Intensity: Grams of CO₂e per transaction/request. This is the gold standard. It measures efficiency of your actual business output.
Percentage of workloads on ARM (Graviton): Graviton processors use architecture that simply does more work per watt. Migrating x86 workloads to Graviton is often the single biggest lever you can pull for a free performance boost and energy reduction. I’m not kidding; the performance-per-watt gains are absurd. If you’re not even trying to use Graviton, you’re leaving a massive amount of money and efficiency on the table.

Set a goal like: “Increase our fleet’s average CPU utilization by 20% and migrate 50% of eligible workloads to Graviton by end of year.” Now that’s a goal.

Maximizing Utilization: The Art of Not Wasting Joules

This is where the rubber meets the road. You’ve measured your waste and set a goal to reduce it. Now, how?

1. Right-Sizing: This is the lowest-hanging fruit. That c5.4xlarge you fired up for a test two years ago and forgot about? It’s still running. AWS provides AWS Compute Optimizer to give you right-sizing recommendations. Use it. But also, think dynamically.

2. Embracing Serverless: Let’s be clear: Lambda and Fargate aren’t inherently “green”; they’re hyper-efficient because AWS operates them at a utilization rate you could never achieve in your own account. They’re the ultimate expression of maximizing utilization. When your function isn’t running, it consumes zero compute energy. You’re sharing the underlying hardware with everyone else, which is vastly more efficient than everyone running their own half-empty servers.

3. Shutting Up Shop: Non-production environments don’t need to run 24/7. A development environment sitting idle over weekend consumes enough energy to power a small town’s… well, probably not, but it’s a lot. Use AWS Instance Scheduler or simple Lambda functions triggered by Amazon EventBridge to turn things off.

Here’s a simple Lambda function (Python) to stop all EC2 instances with a Environment: Dev tag at 7 PM daily.

import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    
    # Find all running instances with the Tag 'Environment' = 'Dev'
    instances = ec2.describe_instances(
        Filters=[
            {'Name': 'instance-state-name', 'Values': ['running']},
            {'Name': 'tag:Environment', 'Values': ['Dev']}
        ]
    )
    
    instance_ids = []
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_ids.append(instance['InstanceId'])
    
    if instance_ids:
        ec2.stop_instances(InstanceIds=instance_ids)
        print(f"Stopped instances: {', '.join(instance_ids)}")
    else:
        print("No running Dev instances found.")

The Rough Edge: The biggest pitfall is cultural. Engineers are often rewarded for shipping features quickly, not for spending a day decommissioning old resources or right-sizing. You have to make sustainability a part of your definition of “done.” A new feature isn’t complete until its infrastructure is efficient. It’s that simple.