83.9 Storing Scraped Data: SQLite, CSV, and MongoDB
Alright, let’s get this data off the ephemeral stage of your script’s memory and onto something with a bit more permanence. You’ve done the hard part—luring the data out of the wilderness of the web. Now we need to build a good, sensible cage for it. We’re going to talk about three classic options: the trusty spreadsheet (CSV), the rock-solid local database (SQLite), and the flexible document store for when you’re feeling fancy (MongoDB). Each has its place, and your choice will depend entirely on what you plan to do with your hard-won loot.
The Humble CSV: For Data That Needs to Get Out the Door Fast
Let’s start simple. CSV (Comma-Separated Values) is the digital equivalent of a cardboard box. It’s not glamorous, but it’s incredibly effective for moving stuff from point A to point B. Its greatest strength is its universal compatibility. You can email it to your non-technical boss, open it directly in Excel, Google Sheets, or any data analysis tool on the planet. It’s the lowest common denominator of data storage.
The csv module in Python’s standard library is your best friend here. It handles all the annoying edge cases, like what happens when your data contains a comma or a newline character (hint: it’s not pretty if you try to do it yourself with string splitting).
import csv
from my_scraper import get_scraped_data # Assume this returns a list of dicts
data = get_scraped_data()
with open('scraped_products.csv', 'w', newline='', encoding='utf-8') as csvfile:
# Get the fieldnames (keys) from the first dictionary
fieldnames = data[0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print("Data successfully dumped to CSV. Go impress someone in accounting.")
The Pitfalls: CSVs have no concept of data types. Everything is a string. That price field you scraped? It’ll be written as "29.99", and you’ll have to convert it back to a float later. They also aren’t great for large, complex, or nested datasets. If you’re scraping hierarchical data, flattening it into a CSV can feel like trying to stuff a octopus into a string bag. It’s possible, but it’s messy.
SQLite: The Discreet Powerhouse Living in Your Pocket
If your data is going to hang around for more than five minutes and you need to query it—filter, sort, aggregate—you want SQLite. Don’t let the “Lite” fool you; this isn’t some toy. It’s a full-featured, serverless, zero-configuration SQL database engine that lives in a single file on your disk. It’s the swiss army knife of data persistence.
Why use it? Structure, speed, and power. You define a schema, which forces your data into a consistent shape. This is fantastic for data quality. You can’t accidentally store a product name in a price column because SQLite will stop you.
import sqlite3
# Connect to a database file. It will be created if it doesn't exist.
conn = sqlite3.connect('scraped_data.db')
cursor = conn.cursor()
# Create a table. Do this once.
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
price REAL,
url TEXT UNIQUE, -- Prevents storing the same product twice
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
# Imagine this is a new item we scraped
new_product = {
'name': 'The Amazing Widget',
'price': 29.99,
'url': 'https://example.com/widget'
}
# Insert it, gracefully handling if the URL already exists.
try:
cursor.execute('''
INSERT INTO products (name, price, url)
VALUES (?, ?, ?)
''', (new_product['name'], new_product['price'], new_product['url']))
conn.commit()
print("New product added!")
except sqlite3.IntegrityError:
print("Product URL already exists in the database. Skipping.")
conn.close()
The Gotcha: The schema is a blessing and a curse. If the website’s structure changes and you suddenly get a new field, your existing table schema is now wrong. You’ll need to run an ALTER TABLE command to add the new column. It’s a manageable problem, but it’s a thing you have to manage.
MongoDB: For When Your Data is a Beautiful, Unpredictable Mess
Sometimes, the data you’re scraping is a nested, hierarchical nightmare that would give a relational database architect cold sweats. This is where MongoDB, a NoSQL document database, shines. Instead of tables and rows, you store documents (in BSON, which is like JSON) in collections. The beautiful part? Documents in the same collection don’t need to have the same structure. It’s schema-less.
This makes it perfect for rapid prototyping and for scraping sources where the data model might evolve or is inconsistently applied. Got a product that has ten attributes and another that has twelve? No problem. Just insert the document you scraped.
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError
# Connect to the local MongoDB server. Easy.
client = MongoClient('mongodb://localhost:27017/')
# Get the database and collection. They're created on the fly.
db = client['scraping_db']
collection = db['products']
# Our complex, nested product data
product_doc = {
'name': 'Vintage Coffee Maker',
'price': 129.99,
'url': 'https://example.com/coffee-maker', # We'll use this as a unique key
'specs': {
'power': '1100W',
'color': 'Stainless Steel'
},
'reviews': [
{'user': 'Alice', 'rating': 5},
{'user': 'Bob', 'rating': 4}
]
}
# Create a unique index on the URL to prevent duplicates, just like we did in SQLite.
collection.create_index('url', unique=True)
try:
collection.insert_one(product_doc)
print("Document inserted!")
except DuplicateKeyError:
print("Document with that URL already exists. Smart thinking.")
client.close()
The Trade-off: You trade rigid structure for flexibility. The burden of ensuring data consistency and validity moves from the database (which enforces it) to your application code (which should enforce it). If you’re not careful, you can end up with a messy collection where the same field has different data types in different documents, which is a recipe for headaches later. Use it when you need the flexibility, not as an excuse for laziness.
So, which one do you pick? Need a quick export? CSV. Building a robust, queryable dataset? SQLite. Scraping a complex, ever-changing source and just need to get the raw data stored now? MongoDB. Choose your weapon wisely.