41.5 Circuit Breakers and Retry Logic
Right, let’s talk about keeping your distributed system from setting itself on fire. You’re making calls between services over a network—a notoriously flaky piece of infrastructure invented by humans who clearly never had to debug a cascading failure at 3 AM. The two biggest ways this flakiness will bite you are: a slow or failing service taking down its callers (to whom it is now a dependency), and your own retry logic turning a minor blip into a full-blown DDoS attack against the struggling service.
This is where the Circuit Breaker pattern comes in. It’s not a silver bullet; it’s a surge protector for your service-to-service communication. The core idea is stolen from your home’s electrical panel: if a circuit is drawing too much power and risks causing a fire, you flip the breaker. It stops the flow of electricity, gives the wiring a chance to cool down, and after a bit, you can try resetting it. We’re doing the same thing, but with HTTP requests and gRPC calls instead of electrons.
The Three States of the Breaker
A proper circuit breaker has three distinct states, and it’s crucial you understand all of them:
Closed: This is the happy path. Requests flow freely to the downstream service. The breaker is monitoring for failures (e.g., timeouts, 5xx errors). If failures exceed a configured threshold within a time window, the breaker trips and moves to the Open state.
Open: The breaker is, well, open. No requests are allowed through to the downstream service. This is the core protective mechanism. Instead of making the call, the breaker fails fast—either returning a predefined error, a default value, or perhaps a stale cache entry. This gives the failing service room to breathe and recover. After a configured cooldown period (the
sleepwindow), the breaker moves to the Half-Open state.Half-Open: This is the “probe” state. The breaker allows a limited number of test requests (often just one) to pass through. If that request succeeds, fantastic! The service is healthy again, and the breaker resets to Closed. If it fails, the breaker assumes the service is still on fire and immediately trips back to Open for another cooldown period.
Implementing One Without Losing Your Mind
You could implement this state machine yourself with a sync.Mutex, some counters, and timers. I did it once. I don’t recommend it. The edge cases are fiddly. Instead, let’s use the excellent sony/gobreaker package, which implements a robust version of the pattern.
First, get it: go get github.com/sony/gobreaker
Now, let’s wrap an HTTP client.
package main
import (
"context"
"errors"
"fmt"
"io/ioutil"
"net/http"
"time"
"github.com/sony/gobreaker"
)
// We'll create a CircuitBreaker that wraps an http.Client.
var cb *gobreaker.CircuitBreaker
func init() {
// Settings are crucial. These are reasonable defaults to start with.
st := gobreaker.Settings{
Name: "HTTP MyUserService", // Great for metrics and debugging
Timeout: 5 * time.Second, // How long to wait for a command to complete
ReadyToTrip: func(counts gobreaker.Counts) bool {
// Trip the breaker after 5 consecutive failures
return counts.ConsecutiveFailures > 5
},
OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
// Invaluable for logging. You *must* log state changes.
fmt.Printf("CircuitBreaker '%s' changed from %s to %s\n", name, from, to)
},
}
cb = gobreaker.NewCircuitBreaker(st)
}
// GetUserViaCircuitBreaker makes a protected call to a user service.
func GetUserViaCircuitBreaker(userID string) ([]byte, error) {
// The Execute method runs the given function under the protection of the breaker.
body, err := cb.Execute(func() (interface{}, error) {
// This is the potentially fragile operation we're protecting.
url := fmt.Sprintf("http://localhost:8081/users/%s", userID)
resp, err := http.Get(url)
if err != nil {
return nil, err // This will be counted as a failure
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
// CRITICAL: You must treat 5xx as failures!
return nil, errors.New("upstream server error")
}
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return nil, err
}
return body, nil
})
if err != nil {
// This error could be from the HTTP call (a failure) OR from the breaker itself (ErrOpenState).
return nil, err
}
return body.([]byte), nil
}
The key here is that the Execute method handles all the state machine logic for you. Your function is only called if the breaker is Closed or if it’s Half-Open and your call is the lucky probe.
The Retry Rollercoaster
Retries are their own special kind of danger. They are necessary because networks are unreliable, but a naive implementation is a recipe for disaster. The biggest sin is retrying immediately with no delay. If a service is slow and you hammer it with 10 retries in the next 100ms, you’re just making its load problem 10x worse.
You need three things for sane retries:
- A Backoff Strategy: Wait exponentially longer between attempts. First wait 100ms, then 200ms, then 400ms, etc. This gives the service time to recover.
- A Retry Limit: Never retry forever. Know when to give up.
- Idempotency: Only retry operations that are safe to retry. A
GETrequest is idempotent. APOSTrequest that charges a credit card is very much not. The service should ideally be designed with idempotency keys for non-idempotent operations, but that’s a topic for another day.
Let’s combine a breaker with a retry using sony/gobreaker and a simple backoff loop.
func GetUserWithRetryAndCircuitBreaker(userID string) ([]byte, error) {
var err error
var body []byte
// Configure your retry logic. Be ruthless.
maxRetries := 3
retryDelay := 100 * time.Millisecond
for i := 0; i < maxRetries; i++ {
body, err = GetUserViaCircuitBreaker(userID)
if err == nil {
return body, nil // Success!
}
// Is the error from the breaker being open? Don't retry this, it's pointless.
if errors.Is(err, gobreaker.ErrOpenState) {
return nil, fmt.Errorf("circuit breaker is open, not retrying: %w", err)
}
// Log the failed attempt, maybe with the retry count
fmt.Printf("Attempt %d failed: %v. Retrying in %v...\n", i+1, err, retryDelay)
time.Sleep(retryDelay)
retryDelay *= 2 // Exponential backoff
}
return nil, fmt.Errorf("failed after %d attempts: %w", maxRetries, err)
}
Where This All Goes Sideways
- Tuning is Everything: The default values in any library are guesses.
5consecutive failures? Might be perfect, might be terrible. You need to tune these parameters (failure threshold, cooldown period) to your specific service’s latency requirements and failure characteristics. Use metrics to observe the trip rate. - The Fallback is a Feature: What you do in the Open state is a design decision. Failing fast is good, but sometimes returning a slightly stale cached response is better for user experience. Plan this deliberately.
- Don’t Break on 4xx: This is a classic mistake. A
401 Unauthorizedor404 Not Foundis a valid response, not a system failure. Your breaker should only trip on things that indicate the service is unhealthy (timeouts, connection refusals, 5xx errors). Thesony/gobreakerexample above gets this right by explicitly checking the status code. - It’s Not a Force Field: A circuit breaker protects one client. If you have 100 instances of a service, each has its own view of the circuit state. This is usually what you want, but it means a service might still receive some traffic from clients whose breakers haven’t tripped yet. For a global kill switch, you need a different mechanism.