27.7 Client-Side Retry Logic and Backoff

Right, so you’ve sent your request out into the digital void. Sometimes, the void coughs back an error. The question is: do you just stand there, slack-jawed, and give up? Or do you, like a sensible human (or a particularly determined algorithm), try again? This is retry logic, and doing it well is the difference between a resilient application and a flaky mess that fails the moment the network gets a case of the sniffles.

The first rule of retry club is: not everything should be retried. Blasting POST requests into the ether because you got a 500 error is a fantastic way to accidentally buy 17,000 concert tickets or post the same comment 84 times. Your retry strategy must be intelligent and idempotent.

The Idempotency Principle

An idempotent operation is one you can perform multiple times without changing the result beyond the initial application. GET, HEAD, PUT, and DELETE are generally considered idempotent. POST is generally not. A 500 Internal Server Error on a GET request? Sure, retry that. The server had a hiccup, and your request is safe to re-send. A 500 on a POST request that transfers money? You absolute maniac, do not just retry that. You need to know whether the operation succeeded on the server side before the error occurred. This is why APIs will often give you an idempotency key mechanism for non-idempotent actions. Use it.

Here’s the naive, “my-first-network-request” way to do retries. It’s bad, and we’re only showing it so you know what not to do.

# WARNING: TERRIBLE, NAIVE CODE. DO NOT USE.
import requests

def bad_retry_example(url):
    for _ in range(3):  # Just try 3 times, I guess?
        try:
            response = requests.get(url)
            if response.status_code == 200:
                return response
        except requests.exceptions.RequestException:
            pass  # Ah yes, the classic "sweep it under the rug" strategy
    return None

This code is guilty of several crimes. It retries on any exception, including DNS failures that will never succeed. It retries immediately, contributing to a thundering herd problem if the server is struggling. And it retries on non-200 status codes, which might include a 404 Not Found (which will never be found, no matter how many times you ask) or, heaven forbid, a 429 Too Many Requests (where retrying immediately is the worst possible thing you can do).

Retryable vs. Non-Retryable Conditions

Your logic needs to be surgical. You typically only want to retry on:

Specific HTTP status codes: 429 (Too Many Requests), 500 (Internal Server Error), 502 (Bad Gateway), 503 (Service Unavailable), 504 (Gateway Timeout). These often indicate a transient failure.
Specific exceptions: Connection timeouts, connection errors, and read timeouts. These are low-level network issues that might resolve themselves.

You should never retry on:

Client error codes (4xx): 400 (Bad Request), 401 (Unauthorized), 403 (Forbidden), 404 (Not Found). Your request is the problem. It will still be the problem in 100 milliseconds.
Non-idempotent methods (POST) without an idempotency key. Just don’t.

The Art of the Backoff

When you do retry, you must back off. Hammering a failing server with requests every 100ms is like performing CPR on a patient by repeatedly punching them in the chest. You’re not helping; you’re making everything worse.

The goal is to introduce a delay between attempts that increases exponentially. This is called exponential backoff, often with some jitter. Jitter is a random variation added to the delay to prevent a scenario where many failed clients all wake up at the same time and synchronize their retries, creating a wave of traffic that knocks the recovering server over again.

Let’s build a robust retry function using the tenacity library, which exists precisely so you don’t have to hand-roll this complex logic and get it wrong.

import requests
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    retry_if_result,
)

# A function to check if a response is worth retrying
def is_retryable_response(response):
    return response.status_code in {429, 500, 502, 503, 504}

# A function to check if an exception is worth retrying
def is_retryable_exception(exception):
    return isinstance(exception, (
        requests.exceptions.ConnectionError,
        requests.exceptions.Timeout,
    ))

@retry(
    stop=stop_after_attempt(5),  # Give up after 5 attempts
    wait=wait_exponential_jitter(initial=1, max=60, exp_base=2),  # Exponential backoff with jitter: 1s, 2s, 4s, 8s... max 60s.
    retry=retry_if_exception_type(is_retryable_exception) | retry_if_result(is_retryable_response)
)
def make_request_with_retry(url):
    response = requests.get(url)
    # If the response has a retryable status code, tenacity will retry because of `retry_if_result`
    return response

# Usage
try:
    response = make_request_with_retry('https://api.example.com/shaky-endpoint')
    print(f"Success: {response.status_code}")
except Exception as e:
    print(f"Failed after retries: {e}")

This code is smart. It uses a decorator to separate the retry policy from the business logic. It waits longer and longer between tries (wait_exponential). It adds jitter to avoid synchronized retries. It has a hard stop (stop_after_attempt) so it doesn’t retry forever. And its retry condition (retry=) is a combination of specific exceptions and specific HTTP status codes. This is production-grade.

The final, crucial piece is to always implement a circuit breaker. If an endpoint is failing consistently, continuing to retry is a waste of resources. A circuit breaker trips after a certain threshold of failures and stops all subsequent requests for a period of time, allowing the downstream service to recover. Libraries like pybreaker can handle this for you. Think of it as the retry strategy’s bigger, grumpier cousin who steps in and says, “Enough. We’re done here for a while.”