How do you run CPU-bound work efficiently in Python given the GIL?

Detailed Answer

Option 1: separate processes

from concurrent.futures import ProcessPoolExecutor
import math

def is_prime(n):
    if n < 2:
        return False
    return all(n % i for i in range(2, int(math.sqrt(n)) + 1))

numbers = list(range(10_000_000, 10_000_100))
with ProcessPoolExecutor() as pool:
    results = list(pool.map(is_prime, numbers))   # genuinely parallel across cores

Each worker process has its own Python interpreter and its own GIL, so CPU-bound work in different processes truly runs simultaneously on separate cores. The cost: data passed to/from worker processes must be pickled, and process startup has real overhead — this pays off for coarse-grained, CPU-heavy chunks of work, not for many tiny tasks.

Option 2: push the hot loop into native code that releases the GIL

import numpy as np

# Pure Python loop: single-threaded, GIL-bound the whole time
total = sum(x * x for x in range(10_000_000))

# NumPy: the actual multiply-and-sum runs in C, releasing the GIL
arr = np.arange(10_000_000)
total = (arr * arr).sum()

NumPy (and similar C-extension libraries) do the heavy numeric work inside C code that releases the GIL during the computation — this is why NumPy-heavy code can benefit from threads even for "CPU-bound" work: the actual bottleneck has moved out of GIL-held Python bytecode into GIL-free C. Cython supports the same idea explicitly via nogil blocks for hand-written extensions.

Why `threading` alone doesn't help here

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(is_prime, numbers))   # NOT faster than serial --
                                                     # still one thread executing
                                                     # Python bytecode at a time

Since is_prime is pure Python arithmetic (no I/O, no GIL-releasing C call), running it across threads doesn't parallelize the actual work — the GIL still serializes bytecode execution across all four threads.

Choosing between the two options

Reach for multiprocessing when the computation is written in plain Python and can be chunked into independent units of work. process count.
Reach for NumPy/Cython/native extensions when the computation is numeric/vectorizable — this usually gives a far larger speedup than multiprocessing alone, since it avoids both GIL contention and Python's general interpreter overhead.

Interview-ready summary: For CPU-bound pure-Python work, use multiprocessing/ProcessPoolExecutor to get separate interpreters (and GILs) running truly in parallel. For numeric/vectorizable work, push the hot loop into a library like NumPy that does the heavy lifting in C and releases the GIL, which often beats multiprocessing's overhead entirely.

How do you run CPU-bound work efficiently in Python given the GIL?

Quick Answer

Detailed Answer

Option 1: separate processes

Option 2: push the hot loop into native code that releases the GIL

Why `threading` alone doesn't help here

Choosing between the two options

Related Resources

How do you run CPU-bound work efficiently in Python given the GIL?

Quick Answer

Detailed Answer

Option 1: separate processes

Option 2: push the hot loop into native code that releases the GIL

Why threading alone doesn't help here

Choosing between the two options

Related Resources

Why `threading` alone doesn't help here