43.4 Reliability: Foundations, Workload Architecture, Change Management, Failure Management

Right, let’s talk about keeping your stuff running. Not just “it didn’t crash” running, but “it actually does what you told users it would do” running. That’s Reliability. The Framework breaks this down into four sensible, if slightly dry-sounding, pillars. Let’s breathe some life into them.

Foundations

Before you even think about your fancy application code, you need to build on stable ground. This is the unsexy, absolutely critical plumbing of your AWS existence. It’s mostly about your Network and IAM. Get these wrong, and your beautifully architected microservice is just a very expensive, very confused brick.

Think of your network setup (VPC, subnets, route tables) as the foundation of a house. You don’t want to pour the concrete after you’ve built the walls. Use infrastructure-as-code for this. Always. Here’s a bare-minimum Terraform example to create a VPC with public and private subnets. This is your starting point, not the finished product.

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = {
    Name = "Production-Vault"
  }
}

resource "aws_subnet" "public" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"

  tags = {
    Name = "Public-Traffic"
  }
}

resource "aws_subnet" "private" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.100.0/24"
  availability_zone = "us-east-1a"

  tags = {
    Name = "Private-Workers"
  }
}

The pitfall here? Hardcoding AZ names like us-east-1a. To you, it’s 1a; to me, it might be 1b. Use data sources to get available AZs dynamically, or use count and CIDR math to create subnets across multiple AZs. IAM is the other half. The principle of least privilege isn’t a suggestion; it’s the only way to avoid a future headline with your name in it. Your EC2 instance does not need full AdministratorAccess. I promise.

Workload Architecture

This is where you design your actual application to be resilient. The key words here are scaling and decoupling. Your system must handle a load of 10 or 10,000 without you frantically SSHing in at 3 AM to restart Apache.

For scaling, you use services that do it for you. This isn’t 2005. Don’t run your web server on a single EC2 instance. Use an Application Load Balancer with an Auto Scaling Group. Here’s a CloudFormation snippet showing the magic bit—the scaling policy. This tells your group to add instances when CPU usage is high.

WebServerScaleUpPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AutoScalingGroupName: !Ref WebServerGroup
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 60.0

For decoupling, use SQS. Let’s say you have a process that handles image uploads. Instead of making the user wait for the image to be thumbnailed, watermarked, and scanned for inappropriate content, just toss a message into an SQS queue and let a background worker handle it. The user gets a response instantly. Your service stays snappy even under load. The queue acts as a shock absorber. It’s genius, and you should use it everywhere.

Change Management

Changes will break your system. The goal is not to prevent change, but to make it safe. This means automation and immutable infrastructure. You don’t “patch” a server. That’s how you get configuration drift, which is a fancy term for “why does it work in staging but not production?!” You burn it down and deploy a new one from a known-good AMI or container image.

This is where CI/CD pipelines come in. Your pipeline runs tests on the new code, builds a new image, deploys it to a subset of instances, runs integration tests, and only then rolls it out completely. AWS provides the tools (CodeBuild, CodeDeploy), but you have to build the process. A common pitfall is not having automated rollback procedures. If your deployment health check fails, the system should roll back without human intervention. Humans panic. Code doesn’t.

Failure Management

Assume failure. It’s not if, it’s when. The cloud is a fleet of computers the size of a small country; things are constantly failing in ways you’ve never imagined. Your job is to be ready for it.

First, you need to know it’s happening. That means setting up alarms on everything that matters—error rates, latency, queue depth. But don’t stop at “the CPU is high.” You need to know why. This is where you move from basic CloudWatch alarms to something like Contributor Insights or simply well-structured logs.

Second, you practice. You use AWS Fault Injection Service (FIS) to deliberately break your own system in a controlled way. Terminate an instance in your ASG. Does a new one come up? Does your ALB notice and stop sending it traffic? Inject latency into a dependency. Does your service time out gracefully, or does it cascade into a failure? This is called chaos engineering, and it’s the only way to have real confidence in your architecture. Run these experiments in staging first, obviously, unless you have a truly masochistic streak. The rough edge here is that FIS is powerful; a poorly designed experiment can cause a real outage. Test your chaos scripts carefully before you unleash them. Your goal is to learn, not to get paged at 2 AM on a Saturday.