8.10 String Interning and Performance
In Python, string interning is a memory optimization technique where the interpreter proactively stores only one immutable copy of distinct string values. This process is not applied to all strings universally but is a targeted optimization for certain string literals and identifiers. The primary motivation is performance enhancement: by ensuring that identical strings reference the same memory location, the interpreter can drastically speed up dictionary lookups (which are fundamental to Python’s operation) and identity checks (is comparisons) for these strings.
The mechanics rely on a special internal dictionary, known as the “interned” dictionary. When a string is a candidate for interning, Python checks this dictionary. If the string is found, the existing object’s reference is returned. If it is not found, the new string is added to the dictionary, and its reference is returned. This mechanism is why, for interned strings, a is b will be True if a == b.
Which Strings are Interned?
The rules for which strings are automatically interned have evolved across Python versions and are implementation-specific (CPython vs. PyPy, etc.). However, in modern CPython, the general rules are:
- String Literals: Any string literal defined in source code that looks like an identifier is interned. This includes variable names, function names, attribute names, and any string containing only letters, digits, or underscores. For example,
"hello_world"and"spam123"are interned. - Strings of Length 0 or 1: All empty strings and all single-character strings are interned.
- Explicit Interning: Any string can be manually interned using the
sys.intern()function.
Strings that are created programmatically at runtime (e.g., via concatenation, reading from a file, user input, or network operations) are generally not automatically interned.
# Automatic interning of identifier-like literals
a = "hello_world"
b = "hello_world"
print(a is b) # Output: True
# Automatic interning of single characters
c = "x"
d = "x"
print(c is d) # Output: True
# No automatic interning for non-identifier strings or created strings
e = "hello world!" # Contains a space and punctuation
f = "hello world!"
print(e is f) # Output: False (typically, but not guaranteed)
g = "hello"
h = " world"
i = g + h # String created via concatenation at runtime
j = "hello world"
print(i is j) # Output: False
The sys.intern() Function
For scenarios where you have many duplicate string instances that are not automatically interned, you can manually intern them using sys.intern(). This is a powerful tool for optimizing memory usage in applications that might process large volumes of repetitive string data, such as parsing CSV files, processing natural language text, or handling tokens.
import sys
# Simulating many duplicate strings read from an external source
data_list = ["processor"] * 10000 # Creates 10,000 distinct string objects?
# Manually interning them forces them to all point to the same object
interned_data_list = [sys.intern(item) for item in data_list]
# All elements are now the same object, saving significant memory
print(all(interned_data_list[0] is item for item in interned_data_list)) # Output: True
Performance Implications and Best Practices
The primary benefit of interning is the performance gain from faster dictionary lookups and identity checks. Since dictionary keys are often strings, if those keys are interned, the dictionary can use the string’s memory address (a simple integer) for its hash value and comparisons, making lookups extremely fast. An identity check (is) is a pointer comparison and is much faster than a character-by-character value comparison (==).
Best Practice: Use sys.intern() when you have a demonstrable memory or performance problem related to a high volume of duplicate string instances. Don’t pre-emptively intern everything, as the interned dictionary itself consumes memory and the interning process has a cost. Profile your application first to identify if string duplication is a real issue.
Common Pitfall: Do not rely on automatic interning for logic that uses the is operator. Because interning is not guaranteed for all strings, using is to compare strings is a major bug. Always use == for value equality checks. The is operator should only be used for singletons like None.
# DANGER: This is a bug waiting to happen.
def dangerous_function(token):
if token is "SUPER_ADMIN": # May work sometimes if "SUPER_ADMIN" is interned
grant_all_access()
else:
deny_access()
# SAFE: Always do this instead.
def safe_function(token):
if token == "SUPER_ADMIN": # Always checks the value correctly
grant_all_access()
else:
deny_access()
When Not to Intern
Interning is not free. The interned dictionary must be maintained, which costs memory and CPU cycles. Interning a string that is never repeated is a net loss, as you’ve added an entry to the dictionary without saving any memory. Therefore, it is most effective when applied to strings with a high probability of duplication. Very long strings are also rarely good candidates for interning, as the cost of comparing them during the intern lookup can outweigh the benefits, and the probability of accidental duplication is low.