28.7 Correlating Traces with Logs and Metrics

Right, so you’ve got your traces. Beautiful, waterfall diagrams that show you exactly where your 500ms latency spike came from. But traces don’t live in a vacuum. They’re the “what,” but rarely the “why.” That “why” is almost always buried in a log line or screamed by a metric. The real magic happens when you stitch these three pillars of observability together. Without this correlation, you’re just a detective with three separate, incomplete case files.

The goal is simple: to move seamlessly from a slow span in a trace, to the specific log line where your function complained about a slow database query, and then over to the dashboard showing that database’s CPU was indeed pegged at 100% at that exact moment. No more grepping through a million log files or trying to mentally align timestamps. It should feel like one coherent story.

The Golden Key: The Trace Context

This entire correlation scheme hinges on one thing: propagating a shared context. In the OpenTelemetry world, this is the Trace Context, primarily composed of a trace_id and span_id. The trace_id is the unique identifier for the entire request’s journey. The span_id identifies each individual operation within that trace.

The trick isn’t just generating these IDs; it’s about injecting them into your logs and metrics. Every log statement emitted during a traced request should carry the current trace_id and span_id. Every metric you record about that request (e.g., duration, error count) should be tagged with those same identifiers.

Here’s the absurd part: OpenTelemetry doesn’t automatically do this for logs yet. The spec is still stabilizing. So, for now, we have to be a little manual about it. It’s a bit of a chore, but the payoff is enormous.

Injecting Context into Your Logs

Let’s get practical. You need to grab the current span context and slap it into your log structure. Here’s how you do it in a Node.js application using the ubiquitous winston logger. The principle is the same in any language.

// logger.js
const { trace } = require('@opentelemetry/api');
const winston = require('winston');

// Create a custom log format
const injectTraceContext = winston.format((info) => {
  const currentSpan = trace.getSpan(trace.getSpanContext(context.active()));
  if (currentSpan) {
    const spanContext = currentSpan.spanContext();
    info.trace_id = spanContext.traceId;
    info.span_id = spanContext.spanId;
    info.trace_flags = spanContext.traceFlags;
  }
  return info;
});

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    injectTraceContext(), // Our custom formatter goes first
    winston.format.json() // Then output as JSON
  ),
  transports: [new winston.transports.Console()],
});

module.exports = logger;

Now, when you use logger.info('Database query was slow') inside a traced function, the resulting log output will be a structured JSON object like:

{
  "level": "info",
  "message": "Database query was slow",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "timestamp": "2023-10-27T12:00:00.000Z"
}

This is the crucial first step. Your logging backend (e.g., Loki, Elasticsearch, Splunk) now has the keys to link this log directly back to the trace in your tracing backend (e.g., Tempo, Jaeger).

Tying Metrics to Traces

Metrics are a different beast. They’re aggregated, so you can’t attach a trace_id to every individual data point—that would defeat the purpose of summarization. Instead, you use the same context to enrich your metrics.

When you record a metric—like the duration of a critical function—you can add the current trace_id as an exemplar. An exemplar is a sample measurement that carries the full trace context, allowing you to drill down from an aggregated metric (e.g., a spike in the 99th percentile latency) directly to the specific, problematic traces that contributed to that data point.

# pseudocode-ish example with OpenTelemetry Python metrics
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter

meter = metrics.get_meter(__name__)
tracer = trace.get_tracer(__name__)

# Create a histogram to track function duration
function_duration = meter.create_histogram(
    "function.duration",
    description="Duration of the expensive function",
    unit="ms",
)

def expensive_function():
    with tracer.start_as_current_span("expensive_operation") as span:
        start_time = time.time()
        # ... do the expensive thing ...
        duration = (time.time() - start_time) * 1000 # convert to ms

        # Record the metric, attaching the trace context as an exemplar
        span_context = span.get_span_context()
        function_duration.record(duration, {
            "trace_id": span_context.trace_id,
            "span_id": span_context.span_id,
        })

Not every metrics system supports exemplars yet, but Prometheus and Grafana are all over it. This is the future.

The Rough Edges and Pitfalls

Let’s be honest, the logging story in OTel is still maturing. The biggest pitfall is inconsistency. You must ensure this context injection happens for every single log, across every service and language. If one service drops the ball, you have a broken link in your chain. Use structured logging (JSON) religiously—it’s non-negotiable for this.

Another gotcha: sampling. If you’re sampling traces (you probably are, full-trace collection is expensive), you’re by definition going to have traces without logs and logs without traces. Your logging backend needs to be cool with that. The correlation should still work for the traces you did sample.

The best practice? Automate it. Build the context injection directly into your logger configuration and your metric recording utilities. Make it impossible for a developer to forget to do it. This correlation is what transforms a handful of separate signals into a unified superpower for debugging. It’s the difference between knowing that your system is broken and knowing precisely why.