How do you run CPU-bound work efficiently in Python given the GIL?

6 minadvancedconcurrencygilmultiprocessingperformance

Quick Answer

Move the CPU-bound work to **separate processes** (`multiprocessing`, `concurrent.futures.ProcessPoolExecutor`) so each gets its own interpreter and GIL, achieving true multi-core parallelism. Alternatively, push the hot loop into a **C extension or a library that releases the GIL** during the computation (NumPy, Cython with `nogil`, or a Rust extension), so pure-Python threads around it can still run concurrently.

Detailed Answer

Option 1: separate processes

from concurrent.futures import ProcessPoolExecutor
import math

def is_prime(n):
    if n < 2:
        return False
    return all(n % i for i in range(2, int(math.sqrt(n)) + 1))

numbers = list(range(10_000_000, 10_000_100))
with ProcessPoolExecutor() as pool:
    results = list(pool.map(is_prime, numbers))   # genuinely parallel across cores

Each worker process has its own Python interpreter and its own GIL, so CPU-bound work in different processes truly runs simultaneously on separate cores. The cost: data passed to/from worker processes must be pickled, and process startup has real overhead — this pays off for coarse-grained, CPU-heavy chunks of work, not for many tiny tasks.

Option 2: push the hot loop into native code that releases the GIL

import numpy as np

# Pure Python loop: single-threaded, GIL-bound the whole time
total = sum(x * x for x in range(10_000_000))

# NumPy: the actual multiply-and-sum runs in C, releasing the GIL
arr = np.arange(10_000_000)
total = (arr * arr).sum()

NumPy (and similar C-extension libraries) do the heavy numeric work inside C code that releases the GIL during the computation — this is why NumPy-heavy code can benefit from threads even for "CPU-bound" work: the actual bottleneck has moved out of GIL-held Python bytecode into GIL-free C. Cython supports the same idea explicitly via nogil blocks for hand-written extensions.

Why threading alone doesn't help here

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(is_prime, numbers))   # NOT faster than serial --
                                                     # still one thread executing
                                                     # Python bytecode at a time

Since is_prime is pure Python arithmetic (no I/O, no GIL-releasing C call), running it across threads doesn't parallelize the actual work — the GIL still serializes bytecode execution across all four threads.

Choosing between the two options

  • Reach for multiprocessing when the computation is written in plain Python and can be chunked into independent units of work. process count.
  • Reach for NumPy/Cython/native extensions when the computation is numeric/vectorizable — this usually gives a far larger speedup than multiprocessing alone, since it avoids both GIL contention and Python's general interpreter overhead.

Interview-ready summary: For CPU-bound pure-Python work, use multiprocessing/ProcessPoolExecutor to get separate interpreters (and GILs) running truly in parallel. For numeric/vectorizable work, push the hot loop into a library like NumPy that does the heavy lifting in C and releases the GIL, which often beats multiprocessing's overhead entirely.