68.6 Object Interning: Small Integers and String Interning

Right, let’s talk about object interning. It sounds like some kind of corporate punishment, but it’s actually one of those clever, slightly sneaky optimizations that makes you appreciate the hustle of language designers. The core idea is brutally simple: for certain immutable, commonly used objects, don’t create a new one every single time. Instead, maintain a pool of these objects and hand out references to the same one whenever you need it.

Why on earth would we do this? Two huge reasons: it saves memory (obviously) and it enables blazing-fast comparisons using simple identity checks (is) instead of slower value checks (==). It’s a way of cheating, legally.

The Case of the Tiny Integers

Let’s start with small integers. You’ve probably heard that in Python, the statement a is b might be True for small numbers but False for larger ones. This isn’t a bug; it’s a deliberate design choice called the “small integer cache.” Python pre-allocates a range of the most commonly used integers (from -5 to 256, though this is implementation-specific) when the interpreter starts up.

a = 42
b = 40 + 2
print(a is b)  # True. Because both are references to the same pre-allocated '42' object.

c = 300
d = 100 + 200
print(c is d)  # False. 300 is outside the default cached range.
print(c == d)  # True. Their values are still equal, of course.

The logic here is rock solid. Numbers in that range are used constantly as loop counters, list indices, small quantities… you name it. It would be incredibly wasteful in both memory and time to create a new integer object for every i in a million-iteration for loop. So they don’t. They just reuse the same one. It’s a brilliantly pragmatic hack. The takeaway? Never use is for comparing integers (or any value types). Always use ==. Relying on interning behavior is a one-way ticket to confusing, non-portable bugs.

When Strings Get Pooled

Now, let’s get to string interning. This is where things get both more powerful and more nuanced. Python automatically interns strings in certain circumstances to avoid storing the same value over and over again in memory. The rules here are less defined than the integer cache and can change between Python versions, which is why you should understand the why instead of memorizing the what.

The primary trigger for automatic interning is all about identifiers. Anything that looks like a variable name gets interned. This makes perfect sense: variable names need to be compared quickly and often during compilation and execution (e.g., in a dictionary of local variables). If every occurrence of "my_very_long_variable_name" was a different object, things would slow down considerably.

# These are often interned automatically (but don't *count* on it)
a = "hello_world"
b = "hello_world"
print(a is b)  # True. This will likely work on any CPython implementation.

# These are often NOT interned automatically
c = "hello world!"  # Contains a space and punctuation
d = "hello world!"
print(c is d)  # False. This looks less like an identifier.

You can also manually force a string into the intern pool using the sys.intern() function. This is a power-user move for very specific scenarios, like when you’re processing a massive dataset (e.g., from a log file or NLP pipeline) with tons of repeated string values.

import sys

# Simulate reading a ton of repeated tokens from a file
data = ["user_login"] * 1000000  # A list with a million of the same string

# Without interning, even though the value is the same, each string is a unique object.
# This consumes a massive amount of memory for no good reason.

# With manual interning, we can collapse all duplicates to one object.
interned_data = [sys.intern(item) for item in data]

# Now, every occurrence of "user_login" is the exact same object in memory.
# We've saved megabytes, maybe gigabytes, for the cost of a single lookup.

The catch? The intern pool is global and never freed. It’s a one-way street. You intern a string, it lives there until the interpreter shuts down. So you only use sys.intern() when you are absolutely certain you have a huge number of repeated strings that will live for a long time. Interning a bunch of unique, transient strings is a fantastic way to create a memory leak.

The Golden Rule and The Gotchas

The golden rule of interning is this: It’s an implementation detail for optimization. Your code should never rely on it for correctness. Your comparisons must always use == unless you are specifically and deliberately testing for object identity.

The biggest gotcha is assuming the behavior is consistent. The integer cache range is fixed, but string interning rules are not. Code that accidentally relies on "hello" is "hello" being True might work in your script today and shatter into a million pieces when it runs as part of a larger module tomorrow, or on a different Python version. The language makes no guarantees, so neither should you.

Interning is a fantastic trick. It’s your brilliant, lazy friend saying, “Why would we make a hundred copies of the same key when we can just use the one key and pass it around?” Understand it, appreciate it, and use sys.intern() wisely when you’re staring down a memory problem. But otherwise, let the interpreter handle it and keep writing robust, implementation-agnostic code.