62.3 PyMongo: Connecting to MongoDB, CRUD Operations, and Aggregation

Alright, let’s get our hands dirty with PyMongo. Forget the sterile, corporate documentation for a minute. You and I are going to talk about how to actually use this thing to get work done. MongoDB is that brilliant, chaotic friend who’s amazing at some parties and a complete disaster at others. PyMongo is how we, as responsible adults (mostly), chaperone that friend.

First things first, you need to get it. I’m assuming you have a working Python environment. If not, go handle that—I’ll wait.

pip install pymongo

The Connection: It’s Not Just a String, It’s a Lifeline

Connecting is simple, but let’s do it right from the start. You’ll see people casually throwing connection strings around. Don’t be that person. Use environment variables. Your future self, who hasn’t accidentally committed their database password to a public GitHub repo, will thank you.

import os
from pymongo import MongoClient
from pymongo.errors import ConnectionFailure

# Get your credentials from somewhere safe, not your code.
MONGODB_URI = os.getenv('MONGODB_URI', 'mongodb://localhost:27017/')

try:
    # This `connect=False` is a pro move. It stops the driver from
    # connecting immediately, letting you handle errors gracefully.
    client = MongoClient(MONGODB_URI, connect=False)
    # Now we actually force a connection to check if it's alive.
    client.admin.command('ping')
    print("You're in. Party time.")
except ConnectionFailure:
    print("Server not available. Panic? Maybe a little.")

Why connect=False? Because the driver loves to be helpful and connect automatically when you create the MongoClient instance. If your DNS is wonky or the server is down, that automatic connection can throw an exception before you even enter your try block. This way, we control the moment of connection.

CRUD: Create, Read, Update, Delete (But Mostly Read)

You’ve got a client. Now you need a database and a collection. In MongoDB, these are created lazily. That’s a fancy way of saying “they magically appear when you first insert data into them.” It’s both convenient and a terrifyingly easy way to typo your way into a dozen empty collections.

# This doesn't actually create anything yet. It just sets up pointers.
db = client['my_awesome_database']
books_collection = db['books']

Create (Insert)

Let’s shove some data in there. MongoDB loves documents, which are just Python dictionaries wearing a fancy hat.

book_document = {
    "title": "The PyMongo Guide for the Bewildered",
    "author": "A. N. Expert",
    "year": 2023,
    "tags": ["mongodb", "python", "guide"],
    "isbn": "123-4567890123"
}

# insert_one returns a result object, which contains the generated _id
result = books_collection.insert_one(book_document)
print(f"Document inserted with _id: {result.inserted_id}")

That _id is the primary key. If you don’t provide one, MongoDB will generously generate a unique ObjectId for you. It’s almost always what you want.

Read (Find)

Finding stuff is where MongoDB’s query language shines. It’s JSON-like, which is intuitive until it suddenly isn’t.

# Find one document by title
book = books_collection.find_one({"title": "The PyMongo Guide for the Bewildered"})
print(book)

# Find all books by our expert author
cursor = books_collection.find({"author": "A. N. Expert"})
for doc in cursor:
    print(doc)

# Find books published after 2020 with a "python" tag
# This is where we start querying nested structures. Note the quotes.
fancy_query = {
    "year": {"$gt": 2020},
    "tags": "python" # This checks for "python" in the array. Neat, right?
}
cursor = books_collection.find(fancy_query)

Update

Updates are powerful and, consequently, dangerous. The biggest pitfall? By default, update_one only updates the fields you specify, it doesn’t replace the entire document. This is usually what you want. But if you forget the $set operator, you will have a very bad day.

# GOOD: This adds a new field without destroying the rest of the document.
books_collection.update_one(
    {"_id": result.inserted_id},
    {"$set": {"publisher": "Insightful Books Ltd."}}
)

# BAD: This replaces the entire document with just `{"publisher": ...}`.
# This is the number one cause of developers weeping quietly at their desks.
books_collection.update_one(
    {"_id": result.inserted_id},
    {"publisher": "Insightful Books Ltd."} # NO $set! DANGER!
)

Delete

Deletion is permanent. There is no “Are you sure?” dialog box. Be certain.

# Delete the one document we just messed up with our bad update :(
books_collection.delete_one({"_id": result.inserted_id})

Aggregation: Where the Magic (and the Headaches) Happen

The find method is great, but when you need to group, sort, reshape, and calculate data across multiple documents, you need the aggregation pipeline. Think of it as a series of filters and transformers that your data passes through, one step at a time.

Let’s say we want a list of authors and how many books they’ve published since 2020.

pipeline = [
    {"$match": {"year": {"$gte": 2020}}},  # Step 1: Filter recent books
    {"$group": {"_id": "$author", "count": {"$sum": 1}}},  # Step 2: Group by author and count
    {"$sort": {"count": -1}}  # Step 3: Sort by count, descending
]

results = books_collection.aggregate(pipeline)
for author_stats in results:
    print(f"{author_stats['_id']}: {author_stats['count']} books")

Why is this powerful? Each stage feeds its results into the next. $match first filters the documents, so the $group stage has less work to do. This is crucial for performance on large collections.

The most common mistake? Forgetting that the aggregation pipeline operates on a collection, not a cursor you’ve already filtered. Do your filtering inside the pipeline with $match, not before it with find.

So there you have it. PyMongo is a straightforward, no-nonsense driver that gives you just enough rope to build something amazing or hang yourself with a wildly inefficient query. It respects MongoDB’s power and quirks in equal measure. Now go build something. And for heaven’s sake, use $set.