83.8 Rate Limiting, Retry Logic, and Polite Crawling
Right, let’s talk about not getting kicked in the teeth by a server. You might think your little script is just politely asking for public data, but from the server’s perspective, you look exactly like a drunken DDoS attack slamming the door every half-second. Being a polite crawler isn’t just about good manners; it’s about self-preservation. It’s the difference between getting your data and getting your IP address permabanned into the shadow realm.
The core principle is simple: don’t be a greedy idiot. The implementation is where the fun begins.
The Absolute Basics: time.sleep() is Your Friend (Sometimes)
Let’s start with the simplest, most blunt-force tool in the box: time.sleep(). It’s the crawler’s equivalent of saying, “Alright, I’ll count to three before I ask for the next page.”
import time
import requests
from bs4 import BeautifulSoup
def naive_crawler(urls):
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Do something beautiful with the soup...
print(f"Scraped {url}")
# The "be polite" part. This is a full stop.
time.sleep(2) # Sleep for 2 seconds between requests
Why would you do this? Because without it, your loop will fire off requests as fast as your network allows, which is a fantastic way to get noticed and shut down. It’s crude, it’s inefficient (you’re spending most of your time just waiting), but for small, personal projects against forgiving targets, it’s often enough. The problem is that it’s a fixed delay. What if the server is slow? Your script just gets slower with it. What if one request fails? Your whole careful timing goes out the window.
Respecting robots.txt (It’s Not Just a Suggestion)
I know, I know. It feels like reading the terms and conditions. But ignoring robots.txt isn’t just impolite; for some websites, it’s a legally dubious way to get yourself into hot water. Luckily, Python’s urllib.robotparser makes it painless.
from urllib.robotparser import RobotFileParser
import requests
def can_i_scrape(base_url, path):
rp = RobotFileParser()
rp.set_url(base_url + "/robots.txt")
rp.read()
return rp.can_fetch("*", base_url + path)
# Example usage
base_url = "https://example.com"
if can_i_scrape(base_url, "/some/page"):
response = requests.get(base_url + "/some/page")
# ... proceed with scraping
else:
print("The robots.txt file says I'm not allowed here. Abort!")
This is your first line of defense. Check it. It tells you the rules of the house. If it says “don’t crawl these pages,” and you do it anyway, you’re not a clever hacker; you’re just a jerk.
Building a Robust Retry Mechanism
Networks are flaky. Servers have bad days. You will get errors: 500 Internal Server Error, 503 Service Unavailable, 429 Too Many Requests. A production-grade scraper doesn’t just give up at the first hiccup. It gets back up, brushes itself off, and tries again—but not so fast that it makes the problem worse.
Here’s where we move beyond time.sleep and into smarter patterns. The tenacity library is a masterpiece for this. It lets you define retry logic with almost poetic flexibility.
import requests
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
# This decorator is pure gold. It says:
# - Retry on either a RequestException (network issue) or a 429/5xx status code.
# - Use exponential backoff: wait 1s, then 2s, then 4s, etc. This is crucial for 429s.
# - Give up after 5 attempts.
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=10),
retry=(
retry_if_exception_type(requests.exceptions.RequestException) |
retry_if_exception(lambda e: getattr(e, 'response', None) is not None and e.response.status_code in (429, 500, 502, 503, 504))
)
)
def robust_request(url):
response = requests.get(url, timeout=5)
response.raise_for_status() # Raises an HTTPError for 4xx/5xx responses
return response
try:
response = robust_request("https://a-finnicky-site.com/data")
# Process the successful response
except requests.exceptions.HTTPError as e:
print(f"Failed after retries: {e}")
Exponential backoff is the key insight here. It’s the web scraping equivalent of “whoa, okay, sorry, I’ll wait a bit longer.” It automatically handles rate limiting for you. If you get a 429, tenacity will catch the exception, wait, and retry, all without you writing a messy while loop.
Rate Limiting with a Budget: The ratelimit Library
Sometimes you need more precise control. You want to say “I will never make more than 10 requests per minute,” regardless of retries. This is a rate limit you impose on yourself.
from ratelimit import limits, sleep_and_retry
import requests
# Decorate your function with a hard rate limit
@sleep_and_retry
@limits(calls=10, period=60) # 10 calls per 60 seconds
def call_api(url):
response = requests.get(url)
response.raise_for_status()
return response
# You can now call this function in a tight loop, and it will
# automatically sleep to ensure it never exceeds the limit.
urls = [...] * 100
for url in urls:
data = call_api(url) # Will never exceed 10 requests/minute
# process data
The beauty of this is its simplicity. You define the rule at the function level, and the decorator handles the nasty timing math for you. Combine this with the retry logic from tenacity, and you’ve got a incredibly resilient and polite scraping function.
The Golden Rule: Cache Everything You Can
The most polite request is the one you never have to make. If you’re debugging your script and running it repeatedly, for the love of all that is holy, cache the responses locally. This saves the server’s resources and makes your development cycle infinitely faster.
import requests
import diskcache
# Create a simple disk-based cache
cache = diskcache.Cache('./scrape_cache')
def cached_request(url):
if url in cache:
print(f"Cache HIT for {url}")
return cache[url]
else:
print(f"Cache MISS for {url}")
response = requests.get(url)
response.raise_for_status()
cache[url] = response.text # Store the text content
return response.text
# Now you can call this without hammering the site while you tweak your parsing logic.
html = cached_request("https://example.com/expensive-to-generate-page")
It’s stunning how many people skip this step. It turns you from a nuisance into a vaguely competent developer. The bottom line is this: politeness isn’t an afterthought. It’s the foundation of any scraper that you don’t want to throw together, run once, and have break immediately. Build it right from the start.