53.1 pickle: Serializing Arbitrary Python Objects

The pickle module is Python’s primary mechanism for object serialization and deserialization. It transforms complex Python object hierarchies into a byte stream (serialization, via pickling) and reconstructs those objects from the byte stream (deserialization, via unpickling). This process is fundamental for tasks like saving program state to disk, distributing work across processes, or caching computations. Unlike text-based formats like JSON, pickle can handle a vast range of Python types—including functions, classes, and instances—by storing not just the data, but the instructions needed to rebuild the object.

Basic Usage: dump/load and dumps/loads

The module provides two pairs of functions. The dump() and load() functions work with files, while dumps() and loads() (note the ’s’ for string) work with bytes objects.

import pickle

# Data to be serialized
data = {
    'name': 'Alice',
    'age': 30,
    'hobbies': ['robotics', 'chess'],
    'is_active': True
}

# Serialize to a file
with open('data.pkl', 'wb') as f:  # Note the 'wb' for write-binary
    pickle.dump(data, f)

# Serialize to a bytes object
data_bytes = pickle.dumps(data)
print(f"Serialized bytes: {data_bytes[:50]}...")  # First 50 bytes

# Deserialize from a file
with open('data.pkl', 'rb') as f:  # 'rb' for read-binary
    loaded_from_file = pickle.load(f)

# Deserialize from bytes
loaded_from_bytes = pickle.loads(data_bytes)

print(loaded_from_file == loaded_from_bytes == data)  # Output: True

The Pickle Protocol Versions

The pickle protocol is the set of rules defining how data is formatted. Newer protocols offer improvements in efficiency and features. The current default is protocol 4, introduced in Python 3.4. It supports very large objects, more efficient pickling of object types, and supports data out-of-band in PEP 3154.

# Specify a protocol version
high_protocol_data = pickle.dumps(data, protocol=pickle.HIGHEST_PROTOCOL)
print(f"Protocol {pickle.HIGHEST_PROTOCOL} bytes length: {len(high_protocol_data)}")

# You can also specify a protocol to dump()
with open('data_v5.pkl', 'wb') as f:
    pickle.dump(data, f, protocol=5)

It is considered a best practice to always use the highest protocol available between the pickling and unpickling environments for optimal performance and size.

What Can and Cannot Be Pickled

Most Python objects can be pickled: built-in types (lists, dicts, strings, integers, etc.), instances of user-defined classes (provided the class is accessible in the unpickling environment), and functions. However, objects that hold state tied to external resources cannot be meaningfully reconstructed from a byte stream. This includes open file objects, network connections, database cursors, or running threads. Attempting to pickle them will raise a PicklingError.

import pickle

def my_function(x):
    return x * 2

# This works: functions can be pickled
pickled_func = pickle.dumps(my_function)
unpickled_func = pickle.loads(pickled_func)
print(unpickled_func(5))  # Output: 10

# This will fail
f = open('test.txt', 'w')
try:
    pickle.dumps(f)
except pickle.PicklingError as e:
    print(f"Error: {e}")
finally:
    f.close()

Security Implications and the reduce Exploit

This is the most critical pitfall to understand. The pickle module is not secure. Unpickling data from an untrusted source (e.g., a user upload, an unauthenticated network request) is extremely dangerous. The process of unpickling essentially executes arbitrary code to reconstruct objects. A maliciously crafted pickle byte stream can exploit the __reduce__ method to execute any function during the unpickling process.

import pickle
import subprocess

class MaliciousClass:
    def __reduce__(self):
        # This will execute a system command when the object is unpickled
        return (subprocess.Popen, (('echo', 'I am a security breach!'),))

# Create the malicious payload
malicious_data = pickle.dumps(MaliciousClass())

# NEVER DO THIS WITH UNTRUSTED DATA
# This will print "I am a security breach!" to the console
pickle.loads(malicious_data)

Because of this, you must never unpickle data received from an untrusted or unauthenticated source. For data exchange, use a secure, text-based serialization format like JSON or XML.

Controlling Pickling Behavior with Special Methods

You can define how your custom classes are pickled by implementing the __getstate__() and __setstate__() methods. __getstate__ can return a custom state object to be pickled (e.g., omitting transient or recomputable data), and __setstate__ receives that object and uses it to restore the instance’s state upon unpickling.

class DatabaseConnection:
    def __init__(self, connection_string):
        self.connection_string = connection_string
        self.connection = self._connect_to_db()  # Simulated resource
        self.last_accessed = None

    def _connect_to_db(self):
        print("Establishing expensive connection...")
        return "connection_handle"

    def __getstate__(self):
        # Only pickle the data that needs to be saved, not the live connection.
        state = self.__dict__.copy()
        del state['connection']  # Remove the unpickleable resource
        return state

    def __setstate__(self, state):
        # Restore the instance's dictionary
        self.__dict__.update(state)
        # Recreate the connection, which is cheaper than storing it.
        self.connection = self._connect_to_db()

# Usage
db_obj = DatabaseConnection('my-db-server')
pickled_db = pickle.dumps(db_obj)  # Output: "Establishing expensive connection..."
new_db_obj = pickle.loads(pickled_db) # Output: "Establishing expensive connection..."
print(f"Reconnected with: {new_db_obj.connection_string}")