33.6 Istio Observability: Kiali, Jaeger, and Prometheus Integration
Right, let’s pull back the curtain on Istio’s observability stack. This is where the rubber meets the road. You didn’t deploy a service mesh just to say you did; you did it to see what the hell is going on inside your system. Istio gives you a frankly staggering amount of data out of the box, which is both its greatest strength and its most overwhelming curse. We’re going to focus on the big three: Kiali for the high-level “what’s connected to what” view, Jaeger for the “why is this request taking 5000ms” deep dive, and Prometheus, the silent workhorse that powers it all. Buckle up.
The Observability Power Trio: How They Fit Together
First, let’s demystify how these three play together, because it’s a common point of confusion. They aren’t three separate, disconnected tools; they’re an integrated system.
- Prometheus is the foundation. It’s the time-series database. Every second, the Istio sidecars (Envoy proxies) are exposing a mountain of metrics on their
15090port. Prometheus is configured to scrape these endpoints, hoovering up data about every byte sent, every request received, and every millisecond of latency. It’s the raw, unfiltered truth of your network traffic. - Jaeger is the distributed tracer. When a request zips through five different services, Jaeger collects the timing and metadata for each hop (called a “span”) and stitches them together into a single “trace.” This lets you follow a single user’s request across the entire microservice graph. It uses the raw data generated by the sidecars and sent to the Jaeger collector.
- Kiali is the visual dashboard. It’s the pretty face on the machinery. It queries Prometheus (and, for tracing, Jaeger) to build interactive graphs of your service topology, show you traffic flow, and highlight which poor service is currently vomiting HTTP 500 errors.
Think of it like this: Prometheus counts the cars on the highway. Jaeger follows one specific car from on-ramp to off-ramp, timing every lane change. Kiali shows you a Google Maps-style overlay of the entire highway system, complete with traffic jams and accident reports.
Getting Your Hands Dirty with Kiali
Kiali is the first place you’ll go when you get a Slack alert that something’s on fire. Its service graph is pure genius for understanding dependencies. But here’s the first “questionable choice”: its default install is often a bit anemic. To get the real, glorious, animated graph with live traffic, you need to ensure you’ve enabled the right metrics. Istio’s default telemetry only gives you HTTP/gRPC workload metrics. Want to see database calls or Redis interactions? You need to mess with Telemetry resources.
Let’s say you want more detailed metrics for a specific workload. You might apply a custom telemetry configuration like this:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: api-auth-metrics
namespace: my-app
spec:
selector:
matchLabels:
app: api-auth
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_COUNT
mode: CLIENT_AND_SERVER
tagOverrides:
destination_port:
value: "string(destination.port)"
db_name:
value: "request.url_path | urlpath | substring | '?' | cut(d= '/', f=2) | cut(d='?', f=1)"
This YAML monstrosity is telling Istio to generate a REQUEST_COUNT metric for both client and server calls for the api-auth pod, and to add custom tags for the destination port and even try to parse a database name out of the URL path. This is powerful, but it’s also where you can absolutely murder your Prometheus server’s memory footprint if you get too tag-happy. Best practice: be surgical. Only add the tags you actually need for debugging.
Tracing a Request with Jaeger
Jaeger is your microservices detective. The magic here is context propagation. Istio’s sidecars automatically generate and pass along tracing headers (like x-request-id, x-b3-traceid). This works beautifully for traffic between meshed services. The pitfall? The moment you call an external service or a service that isn’t part of the mesh, the chain breaks. The trace just… stops. It’s frustratingly common.
To fix this, you need to manually propagate those headers in your application code. Here’s a simplistic example in Python using the OpenTracing library:
from flask import request, Flask
import requests
from jaeger_client import Config
app = Flask(__name__)
# Initialize Jaeger tracer (setup code would typically be in a separate config)
def init_tracer(service):
config = Config(config={'sampler': {'type': 'const', 'param': 1}}, service_name=service)
return config.initialize_tracer()
tracer = init_tracer('my-awesome-service')
@app.route('/process')
def process_data():
# Start a span for this request
with tracer.start_span('process_data') as span:
# Get the incoming headers to maintain the trace
incoming_headers = request.headers
# Prepare headers for the outgoing call to the non-meshed legacy service
outgoing_headers = {}
# The tracer can inject the context into the headers for the next call
tracer.inject(span.context, headers=outgoing_headers)
# Make the call to the external service, passing the tracing headers
legacy_response = requests.get('http://legacy-service.internal/data', headers=outgoing_headers)
return f"Processed: {legacy_response.text}", 200
This code picks up the incoming trace, creates a new span for its work, and then manually injects the tracing context into the headers of the request to the non-meshed legacy-service. This allows Jaeger to continue the trace across the mesh boundary. The best practice is to bake this kind of logic into your common HTTP client libraries so developers don’t have to think about it.
Prometheus: The Unseen Giant
You don’t “use” Prometheus directly here as much as you thank it for its service. The key insight is that Istio’s default Prometheus setup is pre-configured with all the right scrape jobs and labels. Your job is to not break it. The most common pitfall is not understanding its data retention and cardinality.
Every unique combination of metric name and label values creates a new time series. That custom Telemetry resource we made earlier? Each unique db_name and destination_port creates a new series. If you have 1000 different database names, you’ve just created 1000 series for that one metric. Do that for a few metrics and you can easily overwhelm Prometheus. The best practice is to use labels you plan to actually query on. If you’re not going to graph by db_name, don’t add it as a label.
To see what Istio is actually feeding Prometheus, you can always curl the metrics endpoint of any sidecar:
kubectl exec -it deploy/my-app -c istio-proxy -- curl localhost:15090/stats/prometheus | grep 'istio_requests_total'
You’ll get a wall of text that looks like istio_requests_total{connection_security_policy="mutual_tls",destination_app="reviews",...} 12345. Every one of those key-value pairs in the curly braces is a label that Prometheus will store. This is the raw data that makes the whole observability empire run. Respect it, and it will give you unparalleled insight. Abuse it, and it will grind to a halt and take your dashboards with it.