28.1 Why Distributed Tracing: Finding Latency Across Services

Right, let’s talk about latency. You’ve probably stared at a dashboard full of green checkmarks for every service—your API gateway, your user service, your recommendation engine, your database—and yet, your end-user is complaining that the app is “slow.” You know the feeling. It’s the infrastructure equivalent of a mystery novel where everyone has an alibi. The individual service metrics are useless because the crime—the latency—was committed between them, in the network calls. This is why we need a detective, and that detective is distributed tracing.

Distributed tracing is the art and science of following a single user request as it prances through your entire ecosystem of services. It gives you a complete story, not just a collection of isolated chapter summaries. You see not just that service A took 50ms and service B took 100ms, but that service A spent 45 of its 50ms just waiting for service B to respond. That’s the insight that changes everything.

The Core Concepts: Spans, Traces, and Context

Think of a Trace as the entire story of one request. It’s the full narrative, from the moment the user clicks a button to the moment the final byte of the response is sent. A trace is a directed acyclic graph (fancy term for “a tree”) of Spans.

A Span represents a single, logical unit of work within that story. It’s a chapter in our novel. “Called the database,” “processed the user data,” “rendered the template.” Each span has a name, a start time, an end time, and a set of key-value attributes (like http.status_code=200 or db.query="SELECT * FROM users").

The magic that stitches these spans together into a coherent trace is called Context Propagation. This is the crucial part everyone gets wrong at first. When Service A calls Service B, it must pack up the current trace context (specifically, a trace_id and a span_id) and send it along. Usually, this is done via HTTP headers. Service B unpacks it, says “Ah, I’m part of this existing story!”, and creates a new span as a child of Service A’s span. Without this explicit propagation, you just get a bunch of orphaned spans staring sadly at each other, unaware of their shared purpose.

Instrumenting Your Code: A Practical Example

Let’s get our hands dirty. Here’s a painfully simple example using the OpenTelemetry JavaScript API. Notice how we manually propagate the context. This is what happens under the hood of fancy auto-instrumentation libraries.

// service-a.js
import { trace, context, propagation } from '@opentelemetry/api';
import axios from 'axios';

const tracer = trace.getTracer('service-a-tracer');

async function makeRequestToServiceB() {
  // Start a new span. This becomes the active span in our context.
  return tracer.startActiveSpan('call-service-b', async (span) => {
    try {
      // Add some useful attributes to the span
      span.setAttribute('http.method', 'GET');
      span.setAttribute('http.url', 'http://service-b:3001/process');

      // Prepare headers object to propagate the trace context
      let headers = {};
      propagation.inject(context.active(), headers);

      // Make the outgoing HTTP call, sending the headers
      const response = await axios.get('http://service-b:3001/process', { headers });
      
      span.setAttribute('http.status_code', response.status);
      return response.data;
    } catch (error) {
      // Crucial: record the error on the span!
      span.recordException(error);
      span.setStatus({ code: trace.StatusCode.ERROR });
      throw error;
    } finally {
      // Always end the span, no matter what.
      span.end();
    }
  });
}

// service-b.js
import { trace, context, propagation } from '@opentelemetry/api';
import express from 'express';

const app = express();
const tracer = trace.getTracer('service-b-tracer');

app.get('/process', async (req, res) => {
  // Extract the trace context from the incoming request headers
  const extractedContext = propagation.extract(context.active(), req.headers);
  // Start a new span that is a child of the extracted context
  return tracer.startActiveSpan('process-data', { attributes: { 'http.method': req.method } }, extractedContext, async (span) => {
    try {
      // Simulate some work
      await someProcessingWork();
      
      span.setStatus({ code: trace.StatusCode.OK });
      res.send('Done!');
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: trace.StatusCode.ERROR });
      res.status(500).send('Error');
    } finally {
      span.end();
    }
  });
});

async function someProcessingWork() {
  // You can create nested spans for internal logic too!
  return tracer.startActiveSpan('some-processing-work', async (span) => {
    await new Promise(resolve => setTimeout(resolve, 100)); // fake work
    span.end();
  });
}

Common Pitfalls and How to Avoid Them

The biggest pitfall is not propagating context correctly. If your HTTP client or server framework doesn’t automatically inject/extract headers (and many don’t without specific OpenTelemetry instrumentation libraries), your traces will break. You’ll get a new trace per service instead of one connected trace. Always test this first.

Another classic is ignoring errors and status codes. A span that ends without a status is like a function that returns void—you have no idea if it succeeded. Always call span.setStatus() and span.recordException(error) in your catch blocks. It turns a useless “this span took 500ms” into a critical “this span took 500ms and failed with a DatabaseConnectionError.”

Over-instrumenting is a less common but still annoying problem. Don’t create a span for every single line of code. You’re not writing a逐行解说 (line-by-line commentary). Create spans for meaningful operations: external calls, complex computations, and important business logic. Your tracing backend will thank you, and your bills will be lower.

Finally, not using sampling in production. If you try to record every single span for every single request, you will drown your tracing system and generate a truly hilarious bill. Use head-based sampling (e.g., “sample 5% of all requests”) or, even better, tail-based sampling (e.g., “only save full traces for requests that are slow or erroneous”) at your collector level.