41.2 Health Checks: Liveness and Readiness Endpoints
Right, let’s talk about your service’s pulse. It’s not enough that your code compiles and your tests pass. In the chaotic, distributed world of microservices, your service needs to constantly tell the world, “I’m here, I’m okay, and I’m ready to do work.” If it can’t do that, the platform running it (Kubernetes, Nomad, etc.) will assume the worst and kill it with extreme prejudice. Health checks are our way of preventing this digital murder. We do this with two fundamental endpoints: /healthz/liveness and /healthz/readiness. They sound similar, but confusing them is a classic rookie mistake that leads to very exciting, very bad outages.
The Crucial Difference: Alive vs. Ready
Think of it like this: Liveness is a heartbeat. Readiness is a raised hand.
A liveness probe answers one simple question: “Is this process still running, or is it a zombie?” It’s a basic check that your application hasn’t locked up completely. If this fails, the platform kills the pod and restarts it. It’s the ultimate “turn it off and on again.”
A readiness probe answers a more nuanced question: “Are you prepared to accept traffic right now?” This is where the real finesse comes in. A service might be alive (its process is running) but not ready for a variety of reasons: it’s still starting up, loading a large cache, lost its connection to the database, or is overwhelmed. If this fails, the platform doesn’t kill your service; it just stops sending it new traffic. This prevents requests from being sent to a service that can’t handle them, which is a fantastic way to avoid cascading failures.
Mixing these up is disastrous. If you put your “database is down” check in the liveness probe, Kubernetes will merrily restart all your pods every time the database hiccups, turning a partial outage into a complete cluster-wide meltdown. Don’t do that.
Implementing Bare-Metal Health Checks in Go
You don’t need a fancy framework for this. The net/http package is more than capable. Let’s build these endpoints from the ground up. We’ll start simple and get more sophisticated.
First, the basic structure. I like to put these on a dedicated port (like 8081) separate from my main application port. This allows you to expose the health endpoints to the cluster’s orchestration system without exposing them to the public internet. But for simplicity, we’ll start on the same server.
package main
import (
"context"
"database/sql"
"fmt"
"net/http"
"time"
)
// A global variable for our DB connection (for demonstration purposes only!)
var db *sql.DB
func main() {
// ... your app setup code, including initializing `db` ...
mux := http.NewServeMux()
mux.HandleFunc("/healthz/liveness", livenessHandler)
mux.HandleFunc("/healthz/readiness", readinessHandler)
// Main app handlers here
mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, world!")
})
fmt.Println("Server starting on :8080")
http.ListenAndServe(":8080", mux)
}
// Liveness: Are we running?
func livenessHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
// Readiness: Are we ready to serve traffic?
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// 1. Check database connection with a context timeout
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()
if err := db.PingContext(ctx); err != nil {
http.Error(w, "Database not ready", http.StatusServiceUnavailable)
return
}
// 2. Check other crucial dependencies here (e.g., Redis, gRPC connections)
// if !cacheClient.IsConnected() {
// http.Error(w, "Cache not ready", http.StatusServiceUnavailable)
// return
// }
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
This is the absolute baseline. The liveness check is brain-dead simple. The readiness check does the minimum responsible thing: it verifies it can talk to its most critical dependency, the database, with a sensible timeout to avoid hanging.
Leveling Up: Structs, Metrics, and Timeouts
The global variable db is a code smell. Let’s fix that and add some useful patterns, like measuring how long these checks take and using a struct to hold our dependencies.
type HealthChecker struct {
db *sql.DB
// Add other dependencies like a cache client, HTTP client, etc.
}
func (h *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
// Log or emit a metric for latency, e.g., healthCheckDuration.WithLabelValues("liveness").Observe(time.Since(start).Seconds())
}
func (h *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
start := time.Now()
status := http.StatusOK
message := "OK"
ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second) // Use the request's context
defer cancel()
if err := h.db.PingContext(ctx); err != nil {
status = http.StatusServiceUnavailable
message = "Database ping failed"
}
// Add more checks here. The first failure wins.
w.WriteHeader(status)
w.Write([]byte(message))
// healthCheckDuration.WithLabelValues("readiness").Observe(time.Since(start).Seconds())
// healthCheckStatus.WithLabelValues("readiness", fmt.Sprintf("%d", status)).Inc()
}
func main() {
db := // ... initialize your database ...
healthChecker := &HealthChecker{db: db}
mux := http.NewServeMux()
mux.HandleFunc("/healthz/liveness", healthChecker.LivenessHandler)
mux.HandleFunc("/healthz/readiness", healthChecker.ReadinessHandler)
// ... rest of your setup ...
}
Now we’re cooking. We’ve encapsulated our dependencies, we’re using the request’s context (which is good practice), and we’ve commented where you’d add instrumentation. Emitting metrics on check latency and status is incredibly valuable for debugging. You’ll quickly see if your database ping time is creeping up, warning you of impending doom before it becomes a full-blown failure.
Common Pitfalls and Battle-Hardened Advice
- Don’t Check Too Much: Your readiness probe should only check crucial dependencies. If an optional, best-effort dependency (like a metrics aggregator) is down, your service should probably still be marked ready. Failing readiness because of a non-critical failure is a great way to take your entire service offline unnecessarily.
- Timeouts Are Your Friend: Always, always use a context with a timeout for any downstream call in a health check. A default
Ping()with no timeout is a recipe for a stuck health endpoint, which will lead to a restart. Three to five seconds is a good starting point. - Keep it Light: The health check endpoint should not be computationally expensive. Don’t run a complex query; a simple
SELECT 1orPing()is perfect. Its job is to check connectivity and basic function, not to run a full integration test suite on every probe. - Be Careful with Auth: Never put authentication on these endpoints. The cluster’s kubelet needs to access them, and managing auth there is a nightmare you don’t want. Secure them via network policies instead—only allow traffic to the health check port from the orchestration system’s nodes.
- Shutdown Gracefully: When your application receives a SIGTERM (e.g., during a rolling update), you should immediately start failing your readiness probe. This tells the platform to stop sending you new traffic. Then, your liveness probe should still succeed long enough for you to finish handling in-flight requests before you shut down. This is the golden path for graceful shutdowns.