53.3 Customizing Pickling: __getstate__ and __setstate__
While Python’s pickle module excels at serializing most objects automatically by storing their __dict__ attribute, this default behavior is insufficient for complex objects. Some objects may contain data that shouldn’t be persisted (like open file handles or database connections), reference other non-serializable objects, or have a state that is computationally expensive to reconstruct. For these scenarios, Python provides the __getstate__() and __setstate__() methods, offering a powerful mechanism to take complete control over the pickling and unpickling process.
The Default Pickling Mechanism and Its Limitations
By default, pickle serializes an object by saving its __dict__. This works perfectly for simple classes but fails in several common cases:
- Non-Serializable Attributes: If an object contains attributes that cannot be pickled (e.g., a
lambdafunction, an openfileobject), the entire pickling operation will fail with aPicklingError. - Transient Data: Objects might have attributes that are only relevant to the current runtime session, such as cached computations, temporary locks, or loggers. Persisting these is unnecessary and can cause errors upon unpickling.
- Security or Privacy Concerns: You may want to deliberately exclude sensitive data (like passwords) from the serialized output.
- Performance Optimization: The full state of an object might be large, but only a small subset is needed to reconstruct it. Custom pickling can store a minimal representation.
Taking Control with getstate
The __getstate__() method allows you to define exactly what should be pickled. When defined on a class, pickle will call this method instead of using the object’s __dict__. The method can return any picklable object (e.g., a dictionary, a tuple, a string). This returned object becomes the serialized representation of your instance.
import pickle
class DatabaseConnection:
def __init__(self, connection_string, cache_size=100):
self.connection_string = connection_string
self.cache = {} # Runtime cache, shouldn't be pickled
self.cache_size = cache_size
# Simulate an expensive connection setup
self._establish_connection()
def _establish_connection(self):
print(f"Establishing connection to {self.connection_string}")
# In a real scenario, this would create a DB-API connection
self.connection = "ACTIVE_CONNECTION" # This isn't picklable
def __getstate__(self):
# Return a state that contains only the data needed for reconstruction.
# We explicitly exclude the non-pickable 'connection' and the transient 'cache'.
state = self.__dict__.copy()
del state['connection']
del state['cache']
return state
# Create an instance
db_conn = DatabaseConnection('postgresql://localhost/mydb')
# Pickle it. __getstate__ is called, excluding the problematic attributes.
serialized_data = pickle.dumps(db_conn)
print("Object pickled successfully.")
Reconstructing State with setstate
The counterpart to __getstate__() is __setstate__(). This method is called during unpickling and is passed the object that was returned by __getstate__(). Its job is to use this state information to restore the object. Crucially, it is responsible for rebuilding any complex or non-serializable parts of the object that were omitted during pickling.
class DatabaseConnection:
# ... __init__ and __getstate__ from previous example ...
def __setstate__(self, state):
# Restore the attributes that were pickled by __getstate__
self.__dict__.update(state)
# Now, we must manually reconstruct the parts we omitted.
# Reinitialize the transient cache.
self.cache = {}
# Re-establish the database connection, which is the expensive operation
# we avoided pickling.
self._establish_connection()
# Unpickle the object. __setstate__ is called with the saved state.
reloaded_db_conn = pickle.loads(serialized_data)
print(f"Unpickled. Connection string: {reloaded_db_conn.connection_string}")
print(f"Cache restored: {reloaded_db_conn.cache}")
# Output:
# Establishing connection to postgresql://localhost/mydb
# Unpickled. Connection string: postgresql://localhost/mydb
# Cache restored: {}
Common Patterns and Best Practices
Inheritance and super(): When using inheritance, it’s good practice to cooperate with the parent class. You can call the parent’s
__getstate__or__setstate__if it exists, often usingsuper().class SpecializedDBConn(DatabaseConnection): def __init__(self, connection_string, special_param): super().__init__(connection_string) self.special_param = special_param def __getstate__(self): # Get the state from the parent class state = super().__getstate__() # Add our own specific state state['special_param'] = self.special_param return state def __setstate__(self, state): # Extract our specific state before updating self.special_param = state.pop('special_param') # Let the parent class restore its state super().__setstate__(state)Minimal State: The core principle is to store the minimal information required for faithful reconstruction. This makes the serialized data smaller, more portable, and often more secure.
Versioning: If the structure of your state dictionary changes over time (e.g., you rename an attribute), older pickled files may become unreadable. A common pattern is to include a version number in your state to handle migrations gracefully in
__setstate__.
Pitfalls and Edge Cases
- Forgetting to Cooperate: The most common mistake is to completely override
__getstate__/__setstate__without considering the base class, breaking inheritance. - Infinite Recursion: Avoid calling
pickle.dumps(self)inside__getstate__(), as this will create an infinite loop. - Security Warning Reminder: Custom pickling does not make the
picklemodule secure. The fundamental warning remains: never unpickle data from untrusted sources. An attacker can craft a malicious pickle that executes arbitrary code during the__setstate__()call. - The getnewargs_ex Alternative: For even finer control, particularly when object creation (
__new__) is involved, you can also look into__getnewargs_ex__()which works alongside these methods to reconstruct the object before__setstate__is called.