83.3 Scrapy: Spiders, Items, Pipelines, and Middleware

Right, so you’ve graduated from BeautifulSoup. It was great for that one-off script to grab some data from a static page, but you’re thinking bigger. You want to scrape an entire website, handle pagination, manage thousands of items, and do it all without getting your IP banned. Welcome to Scrapy. This isn’t a library; it’s a framework. It expects you to structure your project a certain way, and in return, it gives you an industrial-strength crawling engine. Think of it less as a tool and more as a boss who’s actually competent.

The Spider: Your Chief Crawling Officer

At the heart of every Scrapy project is the Spider. This is a class you define that tells the engine two crucial things: where to start crawling (start_requests) and how to extract data from the pages it finds (parse).

Let’s say we want to scrape quotes from quotes.toscrape.com. Here’s a basic spider:

import scrapy

class QuoteSpider(scrapy.Spider):
    name = "quote_spider"  # What you call it with `scrapy crawl`
    start_urls = ['http://quotes.toscrape.com']  # The lazy way to start

    def parse(self, response):
        # This method handles the response for each start_url
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Handle pagination like a pro
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Why this is brilliant: The yield keyword is doing all the heavy work. When you yield a dictionary, Scrapy knows to send it to the pipeline. When you yield a Request (or use response.follow, which is its smarter cousin), Scrapy knows to schedule that URL for fetching. The engine automatically manages the concurrency, scheduling, and retries for all these requests. You just define the flow.

Pitfall Alert: Don’t get tempted to use return a list of items at the end. You must yield them individually inside the loop. Scrapy processes items as they are yielded, which is far more memory-efficient for large crawls.

Items: Your Data’s Bodyguard

Using plain Python dictionaries works, but it’s sloppy. You’ll inevitably typo a key name and spend an hour debugging. Enter the Item. Think of it as a structured data class that formalizes your schema. It makes your data consistent and gives you a place to add pre-processing.

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
    # You can add processing right here
    author_link = scrapy.Field()

# Now in your spider, use it:
def parse(self, response):
    for quote in response.css('div.quote'):
        item = QuoteItem()
        item['text'] = quote.css('span.text::text').get()
        item['author'] = quote.css('small.author::text').get()
        item['tags'] = quote.css('div.tags a.tag::text').getall()
        item['author_link'] = quote.css('small.author + a::attr(href)').get()
        yield item

It looks like a minor change, but it’s a major discipline. It protects you from yourself and is essential for the next step.

Pipelines: The Data Processing Factory

So you’ve yielded an item. Now what? That’s where Pipelines come in. They are a series of classes that process items after the spider has scraped them. Typical jobs include:

Cleaning data (e.g., stripping whitespace from your text field)
Validating data (e.g., checking that the author field isn’t empty)
Storing data (e.g., saving to a database, writing to a JSON file)
Deduplicating items

You activate pipelines in your settings.py by adding them to a list, and they process items in the order you specify.

# pipelines.py
import pymongo

class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        # This is how you pull settings from settings.py
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'scraping')
        )

    def open_spider(self, spider):
        # Called when the spider opens
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        # Called when the spider closes
        self.client.close()

    def process_item(self, item, spider):
        # The main event. Do what you need to with the item.
        collection_name = spider.name
        self.db[collection_name].insert_one(dict(item))
        return item  # MUST return the item for other pipelines to process!

# settings.py
ITEM_PIPELINES = {
   'myproject.pipelines.MongoPipeline': 300, # The number defines order (lower first)
}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'quotes_db'

Crucial Insight: Notice that process_item must return item. If it doesn’t, the item vanishes from the pipeline chain. It’s a common head-scratcher for beginners.

Middleware: The Framework’s Hacking Layer

If Spiders are the what and Pipelines are the then-what, Middleware is the how. This is the most advanced part of Scrapy, allowing you to hook into and modify its core request/response processing. There are two types: Downloader Middleware and Spider Middleware.

You use Downloader Middleware for things like:

Rotating user agents and proxies to avoid bans.
Handling failed requests with custom retry logic.
Automatically handling cookies and sessions.

Here’s a ridiculously simple example that sets a custom user agent for every request:

# middlewares.py
class CustomUserAgentMiddleware:
    def process_request(self, request, spider):
        request.headers['User-Agent'] = 'My Awesome Scraper (Learning from a brilliant book)'

# settings.py
DOWNLOADER_MIDDLEWARES = {
   'myproject.middlewares.CustomUserAgentMiddleware': 543, # It just needs a number to slot in
}

The designers made the middleware system incredibly powerful, but the documentation can feel like a maze. The key is to remember that it’s a chain of classes, each getting a chance to process a request or response. You’re not overriding the engine; you’re plugging into its well-defined joints.

The beauty of Scrapy is that once you set this architecture up, the actual act of crawling is almost an afterthought. You define the rules, and it executes them with ruthless efficiency. It’s the difference between hand-painting a sign and building a printing press. Now go build a press.