Tracing | mikePietsch.com

36.8 CloudTrail Lake: Querying CloudTrail Events with SQL

Right, so you’ve got your CloudTrail logs flowing into a Lake. Congratulations, you’ve successfully moved your digital haystack from one barn (S3) to a slightly more organized barn (Lake). But now what? You’re staring at petabytes of JSON blobs thinking, “There has to be a a better way to find this one specific API call than grep.” There is. It’s called SQL, and CloudTrail Lake’s query feature is your new best friend. It lets you interrogate that mountain of audit data without having to load it into another service or, heaven forbid, download it. Let’s cut through the marketing fluff and get to how it actually works.

36.7 Trail Configuration: Management Events, Data Events, and Insights Events

Alright, let’s talk about configuring a CloudTrail trail. This is where you go from just having logs to actually having a useful logging setup. Think of it as the difference between a firehose of raw data and a precision instrument. We’re going to wire that hose to a sprinkler system, not just point it at the wall and hope for the best. The core of your trail configuration is telling CloudTrail what you want it to actually record. AWS breaks this down into three categories, and getting this wrong is the number one reason people either drown in log noise or miss the one critical event they needed. Let’s demystify them.

36.6 CloudTrail: API Call Logging for Audit and Compliance

Right, let’s talk about CloudTrail. This is the service that saves your bacon. It’s the security camera in the hallway of your AWS account, meticulously recording who came in, what door they used, and what they tried to do. Every API call—every single one—made by a user, role, or service gets logged here. If you ever need to answer the questions “What happened?” or “Who did it?”, this is your first and last stop.

36.5 X-Ray Analytics: Filtering and Aggregating Traces

Right, so you’ve got X-Ray set up and your traces are flowing in. It’s a beautiful mess of data, a veritable firehose of every single thing your system is doing. Staring at the raw trace list is like trying to drink from that firehose. You’ll get water everywhere and probably hurt yourself. This is where X-Ray Analytics comes in—it’s the fancy nozzle and cup that turns that chaotic stream into something you can actually use.

36.4 X-Ray Sampling Rules: Controlling Trace Volume

Right, let’s talk about sampling. You’ve enabled X-Ray, and suddenly your trace data is… a lot. Like, “could-fund-a-small-nation’s-coffee-supply” a lot. That’s because by default, the X-Ray daemon tries to sample one request per second and five percent of additional requests. It’s a decent starting point, but it’s about as subtle as a sledgehammer. For high-throughput services, this default can generate a staggering, expensive, and frankly useless volume of traces. You don’t need a trace for every single health check or load balancer ping. This is where sampling rules come in—they’re your finely-tuned control panel for this firehose of data.

36.3 Service Maps: Visualizing Request Flow and Latency

Alright, let’s talk about visualizing the absolute chaos of your AWS architecture. You’ve got a dozen services whispering to each other across the globe, and when something goes wrong, you’re left staring at a dozen different logs in a dozen different consoles, feeling like a detective with amnesia. This is where X-Ray and CloudTrail stop being buzzwords and start being your brilliant, over-caffeinated partners in crime. Think of it this way: CloudTrail is the who, what, and when. It’s the meticulous security guard logging every single API call made by a user, role, or service in your account. “User Alice called s3:GetObject on my-stupid-bucket at 3:42 PM.” It’s essential for auditing and security, but it’s a flat list of events. It doesn’t show you the conversation between services.

36.2 X-Ray SDK: Instrumenting Lambda, EC2, ECS, and API Gateway

Alright, let’s talk about making your distributed mess… I mean, your distributed application… actually traceable. You’ve built this beautiful, decoupled thing with Lambda functions firing off events, ECS tasks chatting with DynamoDB, and API Gateway tying it all together. It’s glorious until something breaks, and then you’re left staring at CloudWatch logs like a detective without a case file, trying to correlate random timestamps. That’s where X-Ray and its SDK come in—to be your detective partner.

36.1 X-Ray: Distributed Tracing for AWS Applications

Right, let’s talk about X-Ray. You’ve probably heard the term “distributed tracing” thrown around at meetups and felt a slight sense of dread. It sounds complex, and honestly, it can be. But here’s the secret: X-Ray is just a glorified, hyper-organized detective that follows a single user request as it stumbles through the absolute maze of services you’ve built on AWS. It pieces together the story of what happened, where it got stuck, and who (or what service) is to blame. I use it less for routine check-ups and more for when I get a frantic Slack message that says “THE APP IS SLOW” and I need to prove it’s not my code for once.

36. X-Ray and CloudTrail

28.7 Correlating Traces with Logs and Metrics

Right, so you’ve got your traces. Beautiful, waterfall diagrams that show you exactly where your 500ms latency spike came from. But traces don’t live in a vacuum. They’re the “what,” but rarely the “why.” That “why” is almost always buried in a log line or screamed by a metric. The real magic happens when you stitch these three pillars of observability together. Without this correlation, you’re just a detective with three separate, incomplete case files.

28.6 Tempo: Grafana's Trace Storage Backend

Right, so you’ve got your application instrumented, your spans are flying, and your OpenTelemetry Collector is dutifully collecting. Fantastic. But that telemetry data has to go somewhere. You can’t just shout your traces into the void and hope for the best (though I’ve seen teams try). This is where Tempo comes in. Think of it as Grafana’s purpose-built, highly scalable, and refreshingly simple parking garage for your trace data. It’s not trying to be a general-purpose database; it’s built from the ground up to do one thing incredibly well: store and retrieve traces, fast.

28.5 Jaeger: Open-Source Distributed Tracing Backend

Right, so you’ve got OpenTelemetry instrumenting your code and sending out all these lovely spans. Fantastic. But that telemetry data has to go somewhere unless you’re just shouting into the void, which is a terrible architectural pattern. This is where Jaeger comes in. Think of it as your dedicated, high-performance storage and analysis garage for your trace data. It’s open-source, it’s a CNCF graduate (so you know it’s not just some fly-by-night project), and it’s probably the most common backend you’ll hook up to your OpenTelemetry SDK.

28.4 The OpenTelemetry Collector: Pipeline for Traces, Metrics, Logs

Right, so you’ve instrumented your code. Congratulations, you’re now emitting beautiful, pristine telemetry data. Which is a bit like carefully crafting a message in a bottle and throwing it into the ocean. The OpenTelemetry Collector is the fleet of ships and satellites you deploy to actually find those bottles, read them, and radio the contents back to headquarters. It’s the unsung hero, the plumbing, the data bus. You don’t strictly need it, but life without it is a messy, manual, and frankly amateurish affair.

28.3 Instrumenting Applications with OpenTelemetry SDKs

Right, let’s get our hands dirty. You’ve decided you want to know what your distributed system is actually doing, not just what you hope it’s doing. That’s what instrumentation is for. It’s the process of adding observability code—the stuff that generates telemetry—directly into your application. Think of it like adding a flight data recorder to your code. We’re not just logging when it crashes; we’re recording its every operation. With OpenTelemetry, you do this using language-specific Software Development Kits (SDKs). The beauty here is that the API (the interfaces you code against) is separate from the SDK (the implementation that sends data somewhere). This means you can instrument your code today, decide where to send it tomorrow, and change your mind next week without touching a line of application code. It’s a genuinely good design choice, and I don’t say that lightly.

28.2 OpenTelemetry: The Vendor-Neutral Observability Framework

Right, so you’ve decided you want to know what your software is actually doing. Not what you think it’s doing, not what it did in the pristine isolation of your localhost, but what it’s doing right now, in production, while being pummeled by real users and network gremlins. Welcome. The only way to get that picture without losing your mind is with distributed tracing, and the only sane way to implement it in 2024 is with OpenTelemetry.

28.1 Why Distributed Tracing: Finding Latency Across Services

Right, let’s talk about latency. You’ve probably stared at a dashboard full of green checkmarks for every service—your API gateway, your user service, your recommendation engine, your database—and yet, your end-user is complaining that the app is “slow.” You know the feeling. It’s the infrastructure equivalent of a mystery novel where everyone has an alibi. The individual service metrics are useless because the crime—the latency—was committed between them, in the network calls. This is why we need a detective, and that detective is distributed tracing.

28. Distributed Tracing and OpenTelemetry

15. Distributed Tracing Concepts

16. Jaeger Architecture and Deployment

17. Grafana Tempo: Trace Storage

22. LangSmith: Tracing, Evaluation, and Monitoring

72.9 faulthandler: Diagnosing Crashes and Segfaults

Right, so you’ve written some Python. It’s beautiful, it’s elegant, and then—without warning—it exits. No traceback, no KeyboardInterrupt, just a sudden, silent return to the comforting glow of your terminal prompt. Or worse, it spits out Segmentation fault (core dumped) and mocks you from the history log. This, my friend, is where your usual Python tools tap out. print() statements? Useless. The logging module? Never got the message. pdb? Didn’t even get to wake up. Your code has crashed in the C layers, far beneath the comfortable Python runtime where exceptions are raised and caught. This is the realm of dangling pointers, buffer overflows, and corrupted memory. And to diagnose this, you need a different kind of tool. You need faulthandler.

72.8 sys.settrace(): Writing Your Own Debugger or Profiler

Right, so you’ve graduated from print() statements and you’re ready to get serious. You’ve used pdb and maybe even a fancy IDE debugger, and a little voice in your head asked, “How do they do that?” The answer, my friend, is sys.settrace(). It’s the arcane, powerful, and slightly terrifying incantation that allows you to hook into the CPython interpreter’s execution flow. It’s how debuggers, profilers, and coverage tools are born.

72.7 Remote Debugging with debugpy (VS Code, PyCharm)

Right, so your code is misbehaving. But it’s misbehaving on a remote server, in a Docker container, or inside a virtual environment so alien it might as well be on the dark side of the moon. You can’t just slap a print("got here lol") statement in there and run it locally. This is where we graduate from caveman debugging to something with a bit more finesse: remote debugging. We’re going to use debugpy, Microsoft’s brilliantly capable debugger protocol for Python. It’s what lets VS Code’s debugger do its magic, and it plays nicely with PyCharm and other modern IDEs too.

72.6 breakpoint() and PYTHONBREAKPOINT

Right, so you’ve graduated from print("got here") to actual debugging. Congratulations, we’re all very proud. But let’s be honest, fumbling with import pdb; pdb.set_trace() is the digital equivalent of trying to start a fire with two wet sticks. It works, but it’s clumsy, it leaves a mess, and there’s a much better way. Enter breakpoint(). This isn’t just a new function; it’s a cultural shift in Python debugging, and it’s about damn time.

72.5 pdb: Setting Breakpoints and Inspecting State

Right, so print() statements have failed you. They always do. Welcome to the big leagues. When your code is doing something so profoundly idiotic that you can’t even begin to guess why, you need to stop it mid-execution, climb inside its brain, and have a look around. That’s what pdb, the Python debugger, is for. It’s your surgical tool for figuring out what the hell is actually happening, not what you think is happening.

72.4 Structured Logging with structlog

Right, let’s talk about making your logs actually useful. You’ve probably been there: staring at a text file that looks like a frantic, unstructured diary entry written by a machine on three cups of espresso. Timestamp, log level, some vague message… good luck finding the one error in that mess. The default logging module is fine for telling you that something happened, but it’s terrible at telling you the story of why it happened. That’s where structlog comes in. It’s not just a library; it’s a philosophy for turning your logs from a liability into a debuggable, queryable asset.

72.3 Logging to Files, Rotating Handlers, and External Services

Right, so you’ve graduated from print() statements. Good for you. Now let’s talk about doing it properly. Logging to the console is fine for a quick script, but for anything that runs longer than five minutes, you need persistence. You need logs that survive a reboot, that you can grep through at 2 AM when things are on fire, and that don’t fill up your disk and bring the whole operation to a grinding halt. Let’s get into it.

72.2 basicConfig() vs Manual Configuration

Right, let’s settle this. You’ve seen basicConfig() everywhere. It’s the logging equivalent of a friendly “Easy” button. And you’ve probably also seen people creating Logger objects, Handler objects, Formatter objects… and thought, “Why would anyone do that the hard way?” I’m here to tell you that basicConfig() is a fantastic one-night stand, but for a serious, long-term relationship with your application’s logs, you need to do things manually. Let’s break down why.

72.1 The logging Module: Levels, Loggers, Handlers, Formatters

Right, let’s talk about logging. You’ve been using print() statements to debug your code since you wrote your first Hello, World. I get it. It’s immediate, it’s simple, and when your script is three lines long, it’s perfect. But you’re not writing three-line scripts anymore, are you? You’re building applications. And when your application is running on a server at 2 AM and something goes horribly wrong, you’re not going to SSH in to tail -f a bunch of print() output. You need a system. A robust, configurable, and frankly, adult system for understanding what your code is doing when you’re not there to watch it. That system is the logging module.