How do you run CPU-bound work efficiently in Python given the GIL?
Quick Answer
Move the CPU-bound work to **separate processes** (`multiprocessing`, `concurrent.futures.ProcessPoolExecutor`) so each gets its own interpreter and GIL, achieving true multi-core parallelism. Alternatively, push the hot loop into a **C extension or a library that releases the GIL** during the computation (NumPy, Cython with `nogil`, or a Rust extension), so pure-Python threads around it can still run concurrently.
Detailed Answer
Option 1: separate processes
from concurrent.futures import ProcessPoolExecutor
import math
def is_prime(n):
if n < 2:
return False
return all(n % i for i in range(2, int(math.sqrt(n)) + 1))
numbers = list(range(10_000_000, 10_000_100))
with ProcessPoolExecutor() as pool:
results = list(pool.map(is_prime, numbers)) # genuinely parallel across cores
Each worker process has its own Python interpreter and its own GIL, so CPU-bound work in different processes truly runs simultaneously on separate cores. The cost: data passed to/from worker processes must be pickled, and process startup has real overhead — this pays off for coarse-grained, CPU-heavy chunks of work, not for many tiny tasks.
Option 2: push the hot loop into native code that releases the GIL
import numpy as np
# Pure Python loop: single-threaded, GIL-bound the whole time
total = sum(x * x for x in range(10_000_000))
# NumPy: the actual multiply-and-sum runs in C, releasing the GIL
arr = np.arange(10_000_000)
total = (arr * arr).sum()
NumPy (and similar C-extension libraries) do the heavy numeric work
inside C code that releases the GIL during the computation — this is
why NumPy-heavy code can benefit from threads even for "CPU-bound" work:
the actual bottleneck has moved out of GIL-held Python bytecode into
GIL-free C. Cython supports the same idea explicitly via nogil blocks
for hand-written extensions.
Why threading alone doesn't help here
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as pool:
results = list(pool.map(is_prime, numbers)) # NOT faster than serial --
# still one thread executing
# Python bytecode at a time
Since is_prime is pure Python arithmetic (no I/O, no GIL-releasing C
call), running it across threads doesn't parallelize the actual work —
the GIL still serializes bytecode execution across all four threads.
Choosing between the two options
- Reach for
multiprocessingwhen the computation is written in plain Python and can be chunked into independent units of work. process count. - Reach for NumPy/Cython/native extensions when the computation is numeric/vectorizable — this usually gives a far larger speedup than multiprocessing alone, since it avoids both GIL contention and Python's general interpreter overhead.
Interview-ready summary: For CPU-bound pure-Python work, use
multiprocessing/ProcessPoolExecutor to get separate interpreters (and
GILs) running truly in parallel. For numeric/vectorizable work, push the
hot loop into a library like NumPy that does the heavy lifting in C and
releases the GIL, which often beats multiprocessing's overhead entirely.