What the GIL actually locks
CPython's memory management relies on reference counting: every object tracks how many references point to it, and is freed when that count hits zero. Incrementing/decrementing a refcount from multiple threads simultaneously, without synchronization, is a data race that could corrupt an object's refcount (leading to premature frees or memory leaks). The GIL solves this crudely but effectively: only one thread runs Python bytecode at a time, so refcount updates are never actually concurrent.
import threading
counter = 0
def increment():
global counter
for _ in range(1_000_000):
counter += 1
threads = [threading.Thread(target=increment) for _ in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
print(counter) # 4,000,000 -- correct, thanks to the GIL serializing bytecode execution
Without the GIL (or equivalent fine-grained locking), this kind of shared counter update from multiple threads would risk lost updates.
Why "more threads" doesn't mean "more CPU throughput"
def cpu_bound(n):
return sum(i * i for i in range(n))
# Running cpu_bound() on 4 threads doesn't run 4x faster --
# only one thread executes Python bytecode at any instant, GIL or not.
For CPU-bound pure-Python work, threads provide concurrency (multiple things making progress, interleaved) but not parallelism (multiple things running simultaneously on separate cores) — the GIL serializes bytecode execution regardless of how many OS threads and CPU cores exist.
Why threading still helps for I/O-bound work
import time
def slow_io():
time.sleep(1) # releases the GIL while "blocked"
Blocking operations that call into C (file/network I/O, time.sleep,
many library calls) release the GIL while waiting, letting other
Python threads run bytecode in the meantime. This is why
threading/concurrent.futures.ThreadPoolExecutor genuinely speed up
I/O-bound workloads (e.g., many concurrent HTTP requests) even though the
GIL exists — the bottleneck (waiting on the network) isn't CPU work at all.
The real workaround for CPU-bound parallelism: separate processes
Since the GIL is per-interpreter process, multiprocessing sidesteps
it entirely by running separate Python processes, each with its own GIL,
achieving true multi-core parallelism for CPU-bound work at the cost of
inter-process communication overhead (data must be pickled/copied between
processes, not shared directly).
PEP 703: free-threaded (no-GIL) Python
Starting with Python 3.13, an experimental free-threaded build
(python3.13t) removes the GIL, using more fine-grained locking instead —
aiming to give real multi-core parallelism to threaded Python code. As of
this writing it's still opt-in and the ecosystem (C extensions especially)
is still adapting; the standard GIL-enabled build remains the default.
Interview-ready summary: The GIL is CPython's mutex ensuring only one
thread executes Python bytecode at a time, needed because refcount-based
memory management isn't otherwise thread-safe. It doesn't prevent
threading from helping I/O-bound work (the GIL is released during
blocking calls), but it does prevent threads from speeding up CPU-bound
pure-Python code — for that, use multiprocessing, or Python 3.13+'s
experimental free-threaded build.
Related Resources
The decision framework
| Workload | Best tool | Why |
|---|---|---|
| CPU-bound (heavy computation) | multiprocessing | Bypasses the GIL via separate processes — actual multi-core parallelism |
| I/O-bound, moderate concurrency (10s-100s) | threading | Simple to retrofit onto existing sync code; GIL releases during blocking I/O |
| I/O-bound, very high concurrency (1000s of connections) | asyncio | One thread, no per-task OS thread overhead; scales to far more concurrent tasks |
Threading: easiest retrofit for I/O-bound code
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
return requests.get(url).status_code
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(fetch, urls))
Existing synchronous libraries (like requests) work unmodified inside
threads — no need to rewrite calls as async/await. Downside: each
thread has real OS overhead (~MBs of stack space each), so this doesn't
scale gracefully to tens of thousands of concurrent tasks.
Multiprocessing: real parallelism for CPU-bound work
from concurrent.futures import ProcessPoolExecutor
def cpu_heavy(n):
return sum(i * i for i in range(n))
with ProcessPoolExecutor() as pool:
results = list(pool.map(cpu_heavy, [10**7] * 4)) # genuinely runs on 4 cores
Each process has its own interpreter and GIL, so cpu_heavy genuinely
runs in parallel across cores — at the cost of process startup overhead
and needing to pickle data across the process boundary (no shared memory
by default).
Asyncio: massive I/O concurrency, single thread
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as resp:
return resp.status
async def main(urls):
async with aiohttp.ClientSession() as session:
return await asyncio.gather(*(fetch(session, u) for u in urls))
asyncio.run(main(urls)) # can comfortably handle thousands of concurrent requests
A single thread cooperatively switches between thousands of pending
coroutines whenever one is waiting on I/O — no OS thread per task, so
memory/scheduling overhead per concurrent task is far lower than
threading. The catch: it requires an async-compatible library stack
(aiohttp instead of requests, asyncpg instead of a blocking DB
driver) — mixing in a blocking call anywhere freezes the entire event
loop, not just one task.
Combining them
It's common to combine approaches: use asyncio for I/O concurrency, and
delegate genuinely CPU-bound chunks of work to a ProcessPoolExecutor via
loop.run_in_executor(...) so they don't block the event loop.
Interview-ready summary: Pick multiprocessing for CPU-bound parallelism (the GIL makes threads useless for this), threading for moderate I/O concurrency with minimal code changes, and asyncio when you need very high I/O concurrency and are willing to adopt an async library stack throughout.
Related Resources
async def creates a coroutine function
async def fetch_data():
print("start")
await asyncio.sleep(1) # suspend here; event loop runs other work meanwhile
print("done")
return 42
coro = fetch_data() # nothing has run yet -- just a coroutine object
Calling fetch_data() does not execute the body — like a generator,
it returns a coroutine object that must be driven (via await, asyncio.run,
or scheduled as a task) for its code to actually execute.
The event loop: single-threaded cooperative scheduling
import asyncio
async def worker(name, delay):
print(f"{name} starting")
await asyncio.sleep(delay)
print(f"{name} done")
async def main():
await asyncio.gather(
worker("A", 2),
worker("B", 1),
)
asyncio.run(main())
# A starting
# B starting
# B done <- after ~1s
# A done <- after ~2s (not 3s! -- they ran concurrently)
asyncio.gather schedules both worker coroutines as concurrent tasks.
When worker("A", 2) hits await asyncio.sleep(2), it tells the event
loop "wake me up in 2 seconds, and meanwhile run something else" — the
loop then runs worker("B", 1) until it also suspends. This is why the
total time is ~2s (the max), not ~3s (the sum): the two sleep calls
overlap because a single thread is interleaving them, not running them in
true parallel, but scheduling them so neither blocks the other while
waiting.
What await actually does
await only works on awaitables (coroutines, Tasks, Futures). It:
- Suspends the current coroutine at that point, saving its state (much like a generator's suspended frame — coroutines are, in fact, implemented on the same underlying mechanism as generators).
- Registers a callback so the event loop knows to resume this coroutine once the awaited thing completes.
- Returns control to the event loop, which picks another ready task/ callback to run.
- When the awaited operation finishes, the event loop resumes the
original coroutine exactly where it left off, and
awaitevaluates to the awaited thing's result.
Cooperative, not preemptive
Because there's no operating-system-level time-slicing, a coroutine that
never awaits anything (e.g., a tight CPU-bound loop with no suspension
points) blocks the entire event loop — no other coroutine gets to run
until it returns. This is the single most important asyncio rule: only
await genuinely yields control; ordinary synchronous code inside an
async def function runs to completion without interruption.
Interview-ready summary: The event loop is a single-threaded
scheduler running one coroutine at a time; await is the only point
where control can be voluntarily handed back to the loop, letting other
coroutines make progress while the current one waits on I/O. This
cooperative model gives massive I/O concurrency on one thread, but a
coroutine that blocks without awaiting stalls every other task.
Related Resources
Same suspension mechanism, different intent
def gen(): # generator: produces a SEQUENCE of values
yield 1
yield 2
async def coro(): # coroutine: produces ONE eventual result
await asyncio.sleep(1)
return 42
Both gen() and coro() return objects representing suspended
computation rather than running immediately — under the hood, CPython's
native coroutines (async def) are implemented with the same frame-
suspension machinery that powers generators (historically, asyncio was
even built directly on @types.coroutine-decorated generators before
native coroutine syntax existed).
How they're driven differs
# Generator: driven by iteration
for value in gen():
print(value)
# Coroutine: driven by the event loop, via await/asyncio.run
result = await coro() # inside another coroutine
result = asyncio.run(coro()) # or, at the top level
You can't for loop over a coroutine (it's not iterable in that sense),
and you can't await a plain generator (unless it's specifically
decorated as a generator-based coroutine, a legacy pattern superseded by
async def). Trying to iterate a coroutine directly, or await a plain
generator, raises a TypeError.
Purpose: many values vs. one eventual value
- A generator's job is to lazily produce a sequence:
yieldeach value, potentially infinitely many, consumed one at a time. - A coroutine's job is to represent a single asynchronous
operation that will eventually complete with one result (or raise) —
conceptually closer to a
Future/Promisethan to an iterator, even though it's implemented with similar suspension internals.
Async generators: a hybrid
async def async_range(n):
for i in range(n):
await asyncio.sleep(0) # yield control back to the event loop
yield i
async for i in async_range(5):
print(i)
Python also supports async generators (async def containing
yield), which combine both: they lazily produce a sequence and can
await between values, consumed with async for instead of a plain
for loop — used for streaming data over an async source (e.g., reading
paginated results from an async database driver).
Interview-ready summary: Coroutines and generators share the same
suspend/resume mechanism, but generators (yield, driven by for/next)
model lazily producing a sequence of values, while coroutines (await,
driven by the event loop) model a single asynchronous operation resolving
to one eventual result — async generators combine both when you need a
lazily-produced sequence that can also await I/O between items.
Related Resources
Option 1: separate processes
from concurrent.futures import ProcessPoolExecutor
import math
def is_prime(n):
if n < 2:
return False
return all(n % i for i in range(2, int(math.sqrt(n)) + 1))
numbers = list(range(10_000_000, 10_000_100))
with ProcessPoolExecutor() as pool:
results = list(pool.map(is_prime, numbers)) # genuinely parallel across cores
Each worker process has its own Python interpreter and its own GIL, so CPU-bound work in different processes truly runs simultaneously on separate cores. The cost: data passed to/from worker processes must be pickled, and process startup has real overhead — this pays off for coarse-grained, CPU-heavy chunks of work, not for many tiny tasks.
Option 2: push the hot loop into native code that releases the GIL
import numpy as np
# Pure Python loop: single-threaded, GIL-bound the whole time
total = sum(x * x for x in range(10_000_000))
# NumPy: the actual multiply-and-sum runs in C, releasing the GIL
arr = np.arange(10_000_000)
total = (arr * arr).sum()
NumPy (and similar C-extension libraries) do the heavy numeric work
inside C code that releases the GIL during the computation — this is
why NumPy-heavy code can benefit from threads even for "CPU-bound" work:
the actual bottleneck has moved out of GIL-held Python bytecode into
GIL-free C. Cython supports the same idea explicitly via nogil blocks
for hand-written extensions.
Why threading alone doesn't help here
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=4) as pool:
results = list(pool.map(is_prime, numbers)) # NOT faster than serial --
# still one thread executing
# Python bytecode at a time
Since is_prime is pure Python arithmetic (no I/O, no GIL-releasing C
call), running it across threads doesn't parallelize the actual work —
the GIL still serializes bytecode execution across all four threads.
Choosing between the two options
- Reach for
multiprocessingwhen the computation is written in plain Python and can be chunked into independent units of work. process count. - Reach for NumPy/Cython/native extensions when the computation is numeric/vectorizable — this usually gives a far larger speedup than multiprocessing alone, since it avoids both GIL contention and Python's general interpreter overhead.
Interview-ready summary: For CPU-bound pure-Python work, use
multiprocessing/ProcessPoolExecutor to get separate interpreters (and
GILs) running truly in parallel. For numeric/vectorizable work, push the
hot loop into a library like NumPy that does the heavy lifting in C and
releases the GIL, which often beats multiprocessing's overhead entirely.