28.7 Correlating Traces with Logs and Metrics

Right, so you’ve got your traces. Beautiful, waterfall diagrams that show you exactly where your 500ms latency spike came from. But traces don’t live in a vacuum. They’re the “what,” but rarely the “why.” That “why” is almost always buried in a log line or screamed by a metric. The real magic happens when you stitch these three pillars of observability together. Without this correlation, you’re just a detective with three separate, incomplete case files.

28.6 Tempo: Grafana's Trace Storage Backend

Right, so you’ve got your application instrumented, your spans are flying, and your OpenTelemetry Collector is dutifully collecting. Fantastic. But that telemetry data has to go somewhere. You can’t just shout your traces into the void and hope for the best (though I’ve seen teams try). This is where Tempo comes in. Think of it as Grafana’s purpose-built, highly scalable, and refreshingly simple parking garage for your trace data. It’s not trying to be a general-purpose database; it’s built from the ground up to do one thing incredibly well: store and retrieve traces, fast.

28.5 Jaeger: Open-Source Distributed Tracing Backend

Right, so you’ve got OpenTelemetry instrumenting your code and sending out all these lovely spans. Fantastic. But that telemetry data has to go somewhere unless you’re just shouting into the void, which is a terrible architectural pattern. This is where Jaeger comes in. Think of it as your dedicated, high-performance storage and analysis garage for your trace data. It’s open-source, it’s a CNCF graduate (so you know it’s not just some fly-by-night project), and it’s probably the most common backend you’ll hook up to your OpenTelemetry SDK.

28.4 The OpenTelemetry Collector: Pipeline for Traces, Metrics, Logs

Right, so you’ve instrumented your code. Congratulations, you’re now emitting beautiful, pristine telemetry data. Which is a bit like carefully crafting a message in a bottle and throwing it into the ocean. The OpenTelemetry Collector is the fleet of ships and satellites you deploy to actually find those bottles, read them, and radio the contents back to headquarters. It’s the unsung hero, the plumbing, the data bus. You don’t strictly need it, but life without it is a messy, manual, and frankly amateurish affair.

28.3 Instrumenting Applications with OpenTelemetry SDKs

Right, let’s get our hands dirty. You’ve decided you want to know what your distributed system is actually doing, not just what you hope it’s doing. That’s what instrumentation is for. It’s the process of adding observability code—the stuff that generates telemetry—directly into your application. Think of it like adding a flight data recorder to your code. We’re not just logging when it crashes; we’re recording its every operation. With OpenTelemetry, you do this using language-specific Software Development Kits (SDKs). The beauty here is that the API (the interfaces you code against) is separate from the SDK (the implementation that sends data somewhere). This means you can instrument your code today, decide where to send it tomorrow, and change your mind next week without touching a line of application code. It’s a genuinely good design choice, and I don’t say that lightly.

28.2 OpenTelemetry: The Vendor-Neutral Observability Framework

Right, so you’ve decided you want to know what your software is actually doing. Not what you think it’s doing, not what it did in the pristine isolation of your localhost, but what it’s doing right now, in production, while being pummeled by real users and network gremlins. Welcome. The only way to get that picture without losing your mind is with distributed tracing, and the only sane way to implement it in 2024 is with OpenTelemetry.

28.1 Why Distributed Tracing: Finding Latency Across Services

Right, let’s talk about latency. You’ve probably stared at a dashboard full of green checkmarks for every service—your API gateway, your user service, your recommendation engine, your database—and yet, your end-user is complaining that the app is “slow.” You know the feeling. It’s the infrastructure equivalent of a mystery novel where everyone has an alibi. The individual service metrics are useless because the crime—the latency—was committed between them, in the network calls. This is why we need a detective, and that detective is distributed tracing.

— joke —

...