47.7 Start Methods: spawn, fork, forkserver

The choice of start method dictates how a new process is created in Python’s multiprocessing module. This is not a trivial implementation detail; it fundamentally alters how the new process initializes, what resources it inherits, and, consequently, what pitfalls you might encounter. The three methods—'fork', 'spawn', and 'forkserver'—are available on Unix-like systems (Linux, macOS), while Windows is limited to 'spawn' only. The default and available methods can be checked using multiprocessing.get_all_start_methods() and set using multiprocessing.set_start_method() at the beginning of your program.

The Fork Start Method

The 'fork' method is the original and default start method on Unix systems. When a process is started, the parent process uses the os.fork() system call. This creates an almost exact copy of the parent process at the moment of the fork. The entire memory space, including the Python interpreter, loaded modules, and all program state, is duplicated for the child.

Why it works this way: Forking is extremely fast for creating new processes because it leverages copy-on-write (COW) optimizations provided by the operating system. The parent and child initially share the same physical memory pages. These pages are only copied if either process attempts to modify them, making the initial process creation very efficient.

import multiprocessing
import os

def worker():
    # The child process has a copy of the parent's state at fork time.
    print(f"Child process PID: {os.getpid()}, inherited value: {shared_value}")

if __name__ == '__main__':
    multiprocessing.set_start_method('fork')
    shared_value = 42  # Defined before the fork
    process = multiprocessing.Process(target=worker)
    process.start()
    process.join()

Common Pitfalls: The major danger with 'fork' is inheriting too much state. If file descriptors or network connections are open in the parent, the child will also have them, potentially leading to data corruption or socket errors as both processes try to read/write to the same resource. Furthermore, forking a multithreaded process can be perilous. If a thread other than the main one holds a lock (e.g., in the Python GIL or a standard library module) at the moment of forking, that lock will be duplicated in a locked state in the child. The thread that acquired the lock doesn’t exist in the child, making the lock permanently unavailable and likely causing the child process to deadlock.

The Spawn Start Method

The 'spawn' method, the default on Windows and macOS (as of Python 3.8+ due to issues with fork), starts a fresh Python interpreter process. The child process only inherates the resources necessary to execute the target function. It does not inherit the parent’s memory, runtime state, or most file descriptors.

Why it works this way: This method is inherently safer than forking. By starting clean, it avoids the issues of inherited file descriptors, locks, and other potentially problematic state. It ensures a more predictable and isolated execution environment for the child.

import multiprocessing
import os

def worker(queue):
    # This is a new interpreter. It does not know about 'value' from the parent.
    value = queue.get()
    print(f"Child process PID: {os.getpid()}, received value: {value}")

if __name__ == '__main__':
    multiprocessing.set_start_method('spawn')
    value = 42
    queue = multiprocessing.Queue()
    queue.put(value)
    process = multiprocessing.Process(target=worker, args=(queue,))
    process.start()
    process.join()

Common Pitfalls & Best Practices: The primary challenge with 'spawn' is that all data for the child process must be explicitly passed through pickle. This means any object sent to the child (via Queue, Pipe, or as a Process argument) must be picklable. Unpicklable objects (like file handles, database connections, or certain third-party objects) will cause errors. The startup overhead is also higher than 'fork' because a new interpreter must be initialized and all necessary modules must be re-imported.

The Forkserver Start Method

The 'forkserver' method is a hybrid approach available on Unix. When you set the start method to 'forkserver', your program starts a single, minimal “forkserver” process. When a new process is requested, the parent communicates with this server, which then forks itself to create the new child.

Why it works this way: The forkserver process is started in a pristine state, ideally after all necessary modules have been imported and any one-time initialization is complete. This means the children are forked from a clean, quiescent state, avoiding the overhead of 'spawn' while mitigating the risks of inheriting a complex and potentially corrupted state from the main parent process (like open file descriptors from a multithreaded environment).

import multiprocessing
import os

def worker():
    print(f"Child from forkserver, PID: {os.getpid()}")

if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver')
    # The forkserver is started here. Subsequent processes are forked from it.
    processes = []
    for _ in range(3):
        p = multiprocessing.Process(target=worker)
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

Best Practices: This method is excellent for long-running multiprocessing programs where you want to avoid the per-process startup cost of 'spawn' but also want the safety of a clean state. It ensures that the forking itself is done from a simple, single-threaded process, eliminating the concurrency issues associated with forking from a multithreaded parent.

Choosing the Right Method:

Windows: You must use 'spawn'.
Unix:
- Use 'spawn' for maximum safety and compatibility, especially if your code might be run on both Unix and Windows, or if your main process is multithreaded.
- Use 'fork' only if you fully control the pre-fork state (e.g., no other threads, careful handling of file descriptors) and require the fastest possible process creation.
- Use 'forkserver' for programs that will create many processes and can benefit from the safe, pre-initialized state of the server.