Observability | mikePietsch.com

36.8 CloudTrail Lake: Querying CloudTrail Events with SQL

Right, so you’ve got your CloudTrail logs flowing into a Lake. Congratulations, you’ve successfully moved your digital haystack from one barn (S3) to a slightly more organized barn (Lake). But now what? You’re staring at petabytes of JSON blobs thinking, “There has to be a a better way to find this one specific API call than grep.” There is. It’s called SQL, and CloudTrail Lake’s query feature is your new best friend. It lets you interrogate that mountain of audit data without having to load it into another service or, heaven forbid, download it. Let’s cut through the marketing fluff and get to how it actually works.

36.7 Trail Configuration: Management Events, Data Events, and Insights Events

Alright, let’s talk about configuring a CloudTrail trail. This is where you go from just having logs to actually having a useful logging setup. Think of it as the difference between a firehose of raw data and a precision instrument. We’re going to wire that hose to a sprinkler system, not just point it at the wall and hope for the best. The core of your trail configuration is telling CloudTrail what you want it to actually record. AWS breaks this down into three categories, and getting this wrong is the number one reason people either drown in log noise or miss the one critical event they needed. Let’s demystify them.

36.6 CloudTrail: API Call Logging for Audit and Compliance

Right, let’s talk about CloudTrail. This is the service that saves your bacon. It’s the security camera in the hallway of your AWS account, meticulously recording who came in, what door they used, and what they tried to do. Every API call—every single one—made by a user, role, or service gets logged here. If you ever need to answer the questions “What happened?” or “Who did it?”, this is your first and last stop.

36.5 X-Ray Analytics: Filtering and Aggregating Traces

Right, so you’ve got X-Ray set up and your traces are flowing in. It’s a beautiful mess of data, a veritable firehose of every single thing your system is doing. Staring at the raw trace list is like trying to drink from that firehose. You’ll get water everywhere and probably hurt yourself. This is where X-Ray Analytics comes in—it’s the fancy nozzle and cup that turns that chaotic stream into something you can actually use.

36.4 X-Ray Sampling Rules: Controlling Trace Volume

Right, let’s talk about sampling. You’ve enabled X-Ray, and suddenly your trace data is… a lot. Like, “could-fund-a-small-nation’s-coffee-supply” a lot. That’s because by default, the X-Ray daemon tries to sample one request per second and five percent of additional requests. It’s a decent starting point, but it’s about as subtle as a sledgehammer. For high-throughput services, this default can generate a staggering, expensive, and frankly useless volume of traces. You don’t need a trace for every single health check or load balancer ping. This is where sampling rules come in—they’re your finely-tuned control panel for this firehose of data.

36.3 Service Maps: Visualizing Request Flow and Latency

Alright, let’s talk about visualizing the absolute chaos of your AWS architecture. You’ve got a dozen services whispering to each other across the globe, and when something goes wrong, you’re left staring at a dozen different logs in a dozen different consoles, feeling like a detective with amnesia. This is where X-Ray and CloudTrail stop being buzzwords and start being your brilliant, over-caffeinated partners in crime. Think of it this way: CloudTrail is the who, what, and when. It’s the meticulous security guard logging every single API call made by a user, role, or service in your account. “User Alice called s3:GetObject on my-stupid-bucket at 3:42 PM.” It’s essential for auditing and security, but it’s a flat list of events. It doesn’t show you the conversation between services.

36.2 X-Ray SDK: Instrumenting Lambda, EC2, ECS, and API Gateway

Alright, let’s talk about making your distributed mess… I mean, your distributed application… actually traceable. You’ve built this beautiful, decoupled thing with Lambda functions firing off events, ECS tasks chatting with DynamoDB, and API Gateway tying it all together. It’s glorious until something breaks, and then you’re left staring at CloudWatch logs like a detective without a case file, trying to correlate random timestamps. That’s where X-Ray and its SDK come in—to be your detective partner.

36.1 X-Ray: Distributed Tracing for AWS Applications

Right, let’s talk about X-Ray. You’ve probably heard the term “distributed tracing” thrown around at meetups and felt a slight sense of dread. It sounds complex, and honestly, it can be. But here’s the secret: X-Ray is just a glorified, hyper-organized detective that follows a single user request as it stumbles through the absolute maze of services you’ve built on AWS. It pieces together the story of what happened, where it got stuck, and who (or what service) is to blame. I use it less for routine check-ups and more for when I get a frantic Slack message that says “THE APP IS SLOW” and I need to prove it’s not my code for once.

36. X-Ray and CloudTrail

35.8 CloudWatch Embedded Metrics Format (EMF): Logging Custom Metrics

Right, let’s talk about getting your custom metrics out of your application logs and into CloudWatch where they belong. You see, CloudWatch is a bit of a diva. It loves metrics, but it demands they be presented in a very specific, structured way. You could use the PutMetricData API call from your application code, but that’s a great way to drown yourself in network calls, SDK overhead, and code that’s more about telemetry than business logic.

35.7 CloudWatch Dashboards: Visualizing Metrics Across Accounts and Regions

Right, so you’ve got alarms screaming and logs streaming. Fantastic. But staring at a single metric in a single account is like trying to understand a symphony by listening to one violin. It’s time to conduct the whole orchestra. Enter CloudWatch Dashboards: your single pane of (sometimes frustratingly) glass for visualizing the glorious chaos of your multi-account, multi-region infrastructure. The promise is simple: a customizable homepage for your operational sanity. The reality is a powerful tool with some quirks you need to understand, lest you build a beautiful, auto-refreshing monument to a lie.

35.6 CloudWatch Agent: Collecting System-Level Metrics and Application Logs

Right, let’s talk about the CloudWatch Agent. You’ve probably noticed that the default, out-of-the-box CloudWatch metrics for your EC2 instances are… well, they’re pathetic. A few high-level CPU and network stats every five minutes? That’s like trying to diagnose a engine problem by listening to the car from a block away. It’s useless. The CloudWatch Agent is how you fix that. It’s a little daemon you install on your instances to collect a firehose of detailed system-level metrics (like memory, disk, and processes) and, crucially, ship your application logs directly to CloudWatch. Think of it as giving AWS a direct tap into the vitals of your machine.

35.5 Logs Insights: Querying Logs with a SQL-Like Language

Alright, let’s talk about Logs Insights. This is the part where we stop just collecting logs and start actually using them. You’ve been dumping text into a log group for ages, treating it like a black box that you only open during a five-alarm fire. No more. Logs Insights gives you a SQL-ish language to crack that box open and ask it pointed questions. It’s not full SQL, mind you—the CloudWatch team took SQL out back, did some… modifications… and brought back something that’s both powerful and occasionally infuriatingly different. But we work with what we have.

35.4 CloudWatch Logs: Log Groups, Log Streams, and Retention Policies

Right, let’s talk about CloudWatch Logs. This is where your application’s hopes, dreams, and, more importantly, its panicked error messages go to live. It’s the system of record for everything that happens in your AWS universe, but it’s not just a dumb text file in the sky. It has a specific, occasionally infuriating, structure you need to grasp. At its core, CloudWatch Logs is built on two concepts: Log Groups and Log Streams. Think of a Log Group as a folder for a specific type of log. You might have a log group for /api/app, another for /api/auth, and another for your Lambda function my-broke-function. The log group is where you set the big, important policies, like retention.

35.3 CloudWatch Alarms: Threshold, Anomaly Detection, and Composite Alarms

Right, CloudWatch Alarms. This is where we move from passively watching your infrastructure’s weird little performance art piece to actually yelling at it when it misbehaves. An alarm is a state machine that watches a single metric and does something when that metric crosses a threshold for a certain period. It’s your system’s way of tapping you on the shoulder and saying, “Hey, I think I’m on fire. Or maybe I’m just cold. You should probably look into that.”

35.2 Custom Metrics: PutMetricData via CLI and SDK

Alright, let’s talk about getting your own data into CloudWatch. The built-in metrics are great for a quick look, but the moment you need to track something specific to your business—like “number of times a user uploaded a cat picture that was actually a dog,” or “internal queue backlog depth”—you’re in the land of custom metrics. This is where you graduate from watching your cloud to actually instrumenting it. The workhorse here is the PutMetricData API. Don’t let the name fool you; it’s less about “putting” a single data point and more about publishing a batch of them efficiently. You’ll use this through the AWS CLI or an SDK. I almost always recommend the SDK for anything in production—it’s more robust, you get proper error handling, and you can bake it right into your application logic.

35.1 CloudWatch Metrics: Namespaces, Dimensions, and Resolution

Alright, let’s talk about CloudWatch Metrics, the beating heart of your AWS observability. Think of it as the system that collects all the vital signs from your infrastructure and applications. It’s powerful, but it has its own quirky logic. You’re not just learning a tool; you’re learning to think in its particular, dimension-obsessed language. First, the basic unit: a metric is just a time-series data point. CPU at 45% at 12:04:32. Request count at 1,203 at 12:04:33. You get the idea. But AWS doesn’t just throw these numbers into a big, unsorted bucket. They’re organized using three core concepts: Namespaces, Dimensions, and Resolution. Get these right, and you’re a wizard. Get them wrong, and you’re in for a world of confusion.

35. CloudWatch: Metrics, Alarms, Logs Insights, and Dashboards

28.7 Correlating Traces with Logs and Metrics

Right, so you’ve got your traces. Beautiful, waterfall diagrams that show you exactly where your 500ms latency spike came from. But traces don’t live in a vacuum. They’re the “what,” but rarely the “why.” That “why” is almost always buried in a log line or screamed by a metric. The real magic happens when you stitch these three pillars of observability together. Without this correlation, you’re just a detective with three separate, incomplete case files.

28.6 Tempo: Grafana's Trace Storage Backend

Right, so you’ve got your application instrumented, your spans are flying, and your OpenTelemetry Collector is dutifully collecting. Fantastic. But that telemetry data has to go somewhere. You can’t just shout your traces into the void and hope for the best (though I’ve seen teams try). This is where Tempo comes in. Think of it as Grafana’s purpose-built, highly scalable, and refreshingly simple parking garage for your trace data. It’s not trying to be a general-purpose database; it’s built from the ground up to do one thing incredibly well: store and retrieve traces, fast.

28.5 Jaeger: Open-Source Distributed Tracing Backend

Right, so you’ve got OpenTelemetry instrumenting your code and sending out all these lovely spans. Fantastic. But that telemetry data has to go somewhere unless you’re just shouting into the void, which is a terrible architectural pattern. This is where Jaeger comes in. Think of it as your dedicated, high-performance storage and analysis garage for your trace data. It’s open-source, it’s a CNCF graduate (so you know it’s not just some fly-by-night project), and it’s probably the most common backend you’ll hook up to your OpenTelemetry SDK.

28.4 The OpenTelemetry Collector: Pipeline for Traces, Metrics, Logs

Right, so you’ve instrumented your code. Congratulations, you’re now emitting beautiful, pristine telemetry data. Which is a bit like carefully crafting a message in a bottle and throwing it into the ocean. The OpenTelemetry Collector is the fleet of ships and satellites you deploy to actually find those bottles, read them, and radio the contents back to headquarters. It’s the unsung hero, the plumbing, the data bus. You don’t strictly need it, but life without it is a messy, manual, and frankly amateurish affair.

28.3 Instrumenting Applications with OpenTelemetry SDKs

Right, let’s get our hands dirty. You’ve decided you want to know what your distributed system is actually doing, not just what you hope it’s doing. That’s what instrumentation is for. It’s the process of adding observability code—the stuff that generates telemetry—directly into your application. Think of it like adding a flight data recorder to your code. We’re not just logging when it crashes; we’re recording its every operation. With OpenTelemetry, you do this using language-specific Software Development Kits (SDKs). The beauty here is that the API (the interfaces you code against) is separate from the SDK (the implementation that sends data somewhere). This means you can instrument your code today, decide where to send it tomorrow, and change your mind next week without touching a line of application code. It’s a genuinely good design choice, and I don’t say that lightly.

28.2 OpenTelemetry: The Vendor-Neutral Observability Framework

Right, so you’ve decided you want to know what your software is actually doing. Not what you think it’s doing, not what it did in the pristine isolation of your localhost, but what it’s doing right now, in production, while being pummeled by real users and network gremlins. Welcome. The only way to get that picture without losing your mind is with distributed tracing, and the only sane way to implement it in 2024 is with OpenTelemetry.

28.1 Why Distributed Tracing: Finding Latency Across Services

Right, let’s talk about latency. You’ve probably stared at a dashboard full of green checkmarks for every service—your API gateway, your user service, your recommendation engine, your database—and yet, your end-user is complaining that the app is “slow.” You know the feeling. It’s the infrastructure equivalent of a mystery novel where everyone has an alibi. The individual service metrics are useless because the crime—the latency—was committed between them, in the network calls. This is why we need a detective, and that detective is distributed tracing.

28. Distributed Tracing and OpenTelemetry

26.8 kube-prometheus-stack: The Batteries-Included Helm Chart

Right, so you’ve decided you want metrics. Good choice. Staring at a wall of log files to figure out why your application is having a conniption is like trying to read a book by smelling it. You need numbers, graphs, and a way to ask “what changed five minutes before everything caught on fire?” You could assemble this whole monitoring stack yourself: deploy Prometheus, then Grafana, then the various exporters, then the custom resource definitions (CRDs) for service monitors, then figure out the permissions… it’s a lot. It’s the kind of project that starts on a Friday afternoon and ruins your entire weekend. The kube-prometheus-stack Helm chart is the antidote to that self-inflicted pain. It’s the “batteries-included” approach, and frankly, it’s brilliant.

26.7 Alertmanager: Routing and Silencing Alerts

Alright, let’s get our hands dirty with Alertmanager. You’ve set up Prometheus, it’s firing alerts, and now your inbox is getting flooded because InstanceDown is pinging for that one dev node everyone knows fails every Tuesday. This is where Alertmanager earns its keep. It’s not just a dumb forwarder; it’s the traffic cop, the bouncer, and the notification router for your entire alerting system. Its job is to take the firehose of alerts from Prometheus and route them to the correct people, in the correct way, and only when it absolutely should.

26.6 Grafana Dashboards: Importing and Building

Right, so you’ve got Prometheus scraping all those lovely metrics. Congratulations, you now have a firehose of data pointed directly at your face. Grafana is how you put a nozzle on that hose and actually see what’s going on. It’s the difference between staring at a spreadsheet of numbers and looking at a beautifully rendered graph that tells you, “Hey, your service is on fire.” Let’s get you from data to dashboard.

26.5 PromQL: Querying Kubernetes Metrics

Right, let’s talk PromQL. You’ve got Prometheus scraping all sorts of juicy data from your Kubernetes cluster. That’s step one. But staring at a list of metrics is like staring at a parts bin for a race car—impressive, but useless unless you know how to assemble them into something that tells you how fast you’re going or when you’re about to blow a gasket. That’s where PromQL comes in. It’s the language you use to ask pointed, intelligent questions of your metric data. It’s deceptively simple-looking, but it has a few quirks that will drive you absolutely mad until you understand its internal logic.

26.4 ServiceMonitor and PodMonitor: Prometheus Operator CRDs

Right, so you’ve got Prometheus installed via its Operator. Good for you. That was the easy part. Now comes the actual magic trick: telling the thing what to scrape. You could go back to the dark ages of manually editing a prometheus.yml file, but you installed the Operator for a reason. It’s time to use its superpowers: ServiceMonitor and PodMonitor. Think of these as your translators, converting your application’s cry for attention (“Here are my metrics!”) into a language the Prometheus server actually understands.

26.3 node-exporter: Node-Level Hardware and OS Metrics

Right, let’s talk about node_exporter. This is the workhorse, the foundation, the thing that goes out and gets the dirt-under-its-fingernails metrics from the machine your software is running on. It’s not glamorous, but without it, you’re flying blind. Think of it as a highly specific, incredibly diligent intern who runs around your server with a clipboard, meticulously counting everything from CPU cycles to disk I/O, and then formats it all for Prometheus to consume.

26.2 kube-state-metrics: Cluster-Level Metrics from API Objects

Right, so you’ve got Prometheus scraping your nodes and pods. That’s a great start, but it’s like knowing the engine RPM and fuel levels of every single car in a massive parking lot without knowing which ones are actually driving, who’s driving them, or if any of them are about to run out of gas and stall in the middle of the highway. For that, you need to understand the state of your Kubernetes API objects—the Deployments, DaemonSets, StatefulSets, and so on. This is where kube-state-metrics (KSM) comes in. It’s the translator that sits between the abstract world of the Kubernetes API and the concrete, number-crunching world of Prometheus.

26.1 Prometheus Architecture: Scrape, Store, Query, Alert

Right, let’s get this party started. Prometheus isn’t some magical black box that just “knows” about your services. It’s more like a meticulous, slightly obsessive librarian who only knows about the books you explicitly tell it to go and read the title of, at very specific times. Its entire worldview is built on a simple, brutal cycle: scrape, store, query, alert. Miss one beat of this rhythm, and the whole symphony falls apart.