46.3 Race Conditions and Why They Happen

A race condition is a flaw in a program where the output, or the system’s state, is unexpectedly and critically dependent on the relative timing of events. These events are most often the unsynchronized, concurrent execution of multiple threads. The core of the problem lies in the concept of a “critical section”—a piece of code that accesses a shared resource (a variable, a file, a data structure) that must not be accessed by more than one thread at the same time. When multiple threads enter a critical section without coordination, they can interleave their operations in such a way that the final state of the shared resource becomes incorrect, corrupted, or inconsistent.

The fundamental “why” is the nature of preemptive multitasking and the translation of high-level code into low-level instructions. An operating system’s thread scheduler can pause a thread (preempt it) at any point in its execution to allow another thread to run. This includes pausing a thread in the middle of a operation that should be atomic. From a single-threaded perspective, a line of code like counter += 1 appears to be a single, indivisible operation. However, under the hood, this operation compiles to multiple bytecode or machine-level instructions:

Read the current value of counter into a register.
Add 1 to the value in the register.
Write the new value from the register back to counter.

When two threads execute this sequence without synchronization, their instructions can interleave, leading to a “lost update.”

The Classic Lost Update Problem

Consider a simple counter shared between two threads. Each thread increments the counter 1,000,000 times. The expected final value is 2,000,000. However, due to the interleaving of operations, the final result will almost certainly be less.

import threading

class UnsafeCounter:
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1

def test_counter(counter, num_increments):
    for _ in range(num_increments):
        counter.increment()

# Create a shared counter
unsafe_counter = UnsafeCounter()
num = 1_000_000

# Create two threads that will both increment the counter
thread1 = threading.Thread(target=test_counter, args=(unsafe_counter, num))
thread2 = threading.Thread(target=test_counter, args=(unsafe_counter, num))

thread1.start()
thread2.start()
thread1.join()
thread2.join()

print(f"Expected final value: {2 * num}")
print(f"Actual final value: {unsafe_counter.value}")
# Output will often be less than 2,000,000, e.g., 1,832,471

The interleaving that causes the lost update might look like this:

Thread 1 reads the value (e.g., 100).
The scheduler preempts Thread 1 and switches to Thread 2.
Thread 2 reads the value (still 100), increments it to 101, and writes 101 back.
The scheduler switches back to Thread 1, which still has the old value (100) in its register.
Thread 1 increments its local value to 101 and writes 101 back to memory.

The increment performed by Thread 2 has been completely lost. Both threads read the same base value, so the final state only reflects a single increment instead of two.

Not Just Increments: The Check-Then-Act Pitfall

Race conditions are not limited to simple arithmetic. A more subtle and dangerous category is the “check-then-act” race condition. A thread checks a condition (e.g., “does this key exist in the cache?”), but the condition may become invalid between the time it is checked and the time an action is taken based on that check (“get the value from the cache”).

import threading
import time

class UnsafeCache:
    def __init__(self):
        self.cache = {}
        self.lock = threading.Lock()

    def get_resource_expensive(self, key):
        # Simulate an expensive operation, like a database query or network call
        time.sleep(0.1)
        return f"Expensive_Data_for_{key}"

    def get_unsafe(self, key):
        if key not in self.cache:  # CHECK
            # Between the "check" and the "act," another thread could modify self.cache
            value = self.get_resource_expensive(key)
            self.cache[key] = value  # ACT
        return self.cache[key]

# Using the unsafe method can lead to the expensive operation being run multiple times.
def worker(cache, key):
    data = cache.get_unsafe(key)
    # print(f"Thread {threading.get_ident()} got: {data}")

unsafe_cache = UnsafeCache()
key = "user_123"

# Multiple threads requesting the same uncached resource simultaneously
threads = []
for _ in range(5):
    t = threading.Thread(target=worker, args=(unsafe_cache, key))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print(f"Expensive method was called {unsafe_cache.cache[key].count('Expensive_Data')} times.")
# Ideally, it should be called once. But due to the race, it's likely called 5 times.

Why They Are So Hard to Detect

Race conditions are notoriously non-deterministic. They might cause a catastrophic failure one time in a thousand test runs, or they might silently corrupt data for months before being discovered. Their occurrence depends on a specific and often rare timing: a thread must be preempted at the exact wrong moment. This makes them difficult to reproduce, debug, and test for. Changes in system load, CPU speed, or even the version of the Python interpreter can make a race condition appear or disappear. The only reliable way to prevent them is through the diligent use of synchronization primitives like locks, which ensure that critical sections are executed atomically, as covered in the next section.