13.8 CPython Hash Table Internals
At the heart of every Python dictionary lies a highly optimized hash table implementation, a data structure designed for near-constant time O(1) average complexity for lookups, insertions, and deletions. CPython’s implementation is a marvel of engineering, balancing memory efficiency with raw speed. Understanding its internals is crucial for writing high-performance Python code and for debugging seemingly bizarre behavior.
The Hash Table Structure
A hash table works by using a hash function to map a key to an integer index in an underlying array. CPython’s dict implementation uses a structure often referred to as “open addressing” with a “pseudo-random” probing sequence to resolve collisions (when two keys hash to the same index). The table is essentially a dense array of entries. Each entry is a C struct containing three fields: the hash of the key, a pointer to the key, and a pointer to the value. A crucial optimization is that the table itself only stores these entry structs, while a separate, smaller array holds the indices to the entries. This indirection allows for more efficient probing and table resizing.
The probing sequence isn’t linear (checking the next slot, then the next) but uses a formula that scatters entries across the table, reducing primary clustering. The index is computed as (hash * 5 + 1 + perturb) % table_size, where perturb is adjusted on each probe step. This design ensures that even if keys have similar hashes, their probe sequences diverge quickly.
# This simplified code illustrates the probing concept, not the exact CPython algorithm.
def find_slot(key, table_size):
hash_val = hash(key)
index = hash_val % table_size
perturb = hash_val
while table[index] is not None and table[index].key != key:
perturb >>= 5 # Shift perturb to use its higher bits next time
index = (5 * index + 1 + perturb) % table_size
return index
Resizing and Its Impact on Performance
The hash table must be resized to maintain efficiency. As the number of entries increases, the table becomes more crowded, leading to longer probe sequences and degraded performance. CPython dynamically resizes the table when it becomes about two-thirds full. The resizing strategy aims to keep the table sparse enough to minimize collisions.
Resizing involves allocating a new, larger array (typically roughly twice the current size, but always a power of two for modulo calculation efficiency) and reinserting every existing entry into this new array. Because the modulo operation (hash % new_size) changes with the new size, the index for every key is recalculated. This is an O(n) operation, meaning the cost of a single insertion can be O(n) in the worst case, but when amortized over many insertions, the average cost remains O(1). This is why inserting a large number of items at once is more efficient than inserting them one by one in a loop that might trigger multiple resizes.
import sys
d = {}
original_size = sys.getsizeof(d)
# Insert items until a resize is triggered
for i in range(20):
d[i] = i
current_size = sys.getsizeof(d)
if current_size != original_size:
print(f"Resize at i={i}: {original_size} -> {current_size} bytes")
original_size = current_size
Memory Overhead and Key Ordering (Python 3.6+)
A significant change in Python 3.6, officially made a language guarantee in 3.7, was the preservation of insertion order. This was a side-effect of a memory optimization. The old implementation maintained a sparse table of 24-byte entries. The new implementation uses two data structures: a dense array that stores entries in insertion order (which is why order is preserved) and a separate index array that points to entries in the dense array.
This design reduces memory overhead significantly. The index array is an array of integers (often 1, 2, or 4 bytes each depending on table size), which is much smaller than an array of 24-byte structs. The entries themselves are stored without wasted space. This makes dictionaries in modern Python versions both faster and more memory-efficient than their predecessors.
Pitfalls: Hash Collisions and Attack Vectors
A fundamental assumption of hash tables is that hash collisions are rare. However, if an attacker can supply keys that all hash to the same value, they can force the table into its worst-case performance scenario, turning O(1) operations into O(n). This is known as a Hash-DoS attack. Early Python versions were vulnerable to this.
Python now uses a process called “hash randomization.” On startup, Python generates a random seed, which is used to perturb the hash function for built-in types like strings. This means the hash value for the same string will be different each time a Python process runs, making it impossible for an attacker to predictably generate colliding keys. This is a crucial security feature.
# Demonstrating hash randomization (output will differ on each run)
# Run this script multiple times to see different hash values.
print("Hash of 'hello':", hash("hello"))
print("Hash of 'world':", hash("world"))
Best Practices for Keys and Performance
- Use Hashable, Immutable Keys: Keys must be immutable types (like
str,int,float,tuple) so their hash value cannot change. If a key’s hash changed after being stored, it would become impossible to find. - Avoid Resizing in Critical Loops: If you know the approximate final size of a dictionary, pre-allocate it using the
dict.fromkeys()method or by passing an initial approximate size to the constructor to minimize resizes. - Understand the Cost of Deletions: Deleting an entry doesn’t trigger a resize, but it leaves a “dummy” entry in the table to ensure the probe sequence for other keys remains valid. A dictionary with many deletions can become sparsely populated, wasting memory. In performance-critical sections, consider creating a new dict instead of modifying one with many deletions.
# Pre-allocating a dictionary
known_size = 100000
# This hint is used to create an appropriately sized table from the start.
pre_sized_dict = dict.fromkeys(range(known_size))
# Alternatively, the initial size hint can be provided.
pre_sized_dict_2 = {}.__init__(known_size)