Concurrency, Parallelism & Async

Difficulty

What the GIL actually locks

CPython's memory management relies on reference counting: every object tracks how many references point to it, and is freed when that count hits zero. Incrementing/decrementing a refcount from multiple threads simultaneously, without synchronization, is a data race that could corrupt an object's refcount (leading to premature frees or memory leaks). The GIL solves this crudely but effectively: only one thread runs Python bytecode at a time, so refcount updates are never actually concurrent.

import threading

counter = 0

def increment():
    global counter
    for _ in range(1_000_000):
        counter += 1

threads = [threading.Thread(target=increment) for _ in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
print(counter)   # 4,000,000 -- correct, thanks to the GIL serializing bytecode execution

Without the GIL (or equivalent fine-grained locking), this kind of shared counter update from multiple threads would risk lost updates.

Why "more threads" doesn't mean "more CPU throughput"

def cpu_bound(n):
    return sum(i * i for i in range(n))

# Running cpu_bound() on 4 threads doesn't run 4x faster --
# only one thread executes Python bytecode at any instant, GIL or not.

For CPU-bound pure-Python work, threads provide concurrency (multiple things making progress, interleaved) but not parallelism (multiple things running simultaneously on separate cores) — the GIL serializes bytecode execution regardless of how many OS threads and CPU cores exist.

Why threading still helps for I/O-bound work

import time

def slow_io():
    time.sleep(1)   # releases the GIL while "blocked"

Blocking operations that call into C (file/network I/O, time.sleep, many library calls) release the GIL while waiting, letting other Python threads run bytecode in the meantime. This is why threading/concurrent.futures.ThreadPoolExecutor genuinely speed up I/O-bound workloads (e.g., many concurrent HTTP requests) even though the GIL exists — the bottleneck (waiting on the network) isn't CPU work at all.

The real workaround for CPU-bound parallelism: separate processes

Since the GIL is per-interpreter process, multiprocessing sidesteps it entirely by running separate Python processes, each with its own GIL, achieving true multi-core parallelism for CPU-bound work at the cost of inter-process communication overhead (data must be pickled/copied between processes, not shared directly).

PEP 703: free-threaded (no-GIL) Python

Starting with Python 3.13, an experimental free-threaded build (python3.13t) removes the GIL, using more fine-grained locking instead — aiming to give real multi-core parallelism to threaded Python code. As of this writing it's still opt-in and the ecosystem (C extensions especially) is still adapting; the standard GIL-enabled build remains the default.

Interview-ready summary: The GIL is CPython's mutex ensuring only one thread executes Python bytecode at a time, needed because refcount-based memory management isn't otherwise thread-safe. It doesn't prevent threading from helping I/O-bound work (the GIL is released during blocking calls), but it does prevent threads from speeding up CPU-bound pure-Python code — for that, use multiprocessing, or Python 3.13+'s experimental free-threaded build.

The decision framework

WorkloadBest toolWhy
CPU-bound (heavy computation)multiprocessingBypasses the GIL via separate processes — actual multi-core parallelism
I/O-bound, moderate concurrency (10s-100s)threadingSimple to retrofit onto existing sync code; GIL releases during blocking I/O
I/O-bound, very high concurrency (1000s of connections)asyncioOne thread, no per-task OS thread overhead; scales to far more concurrent tasks

Threading: easiest retrofit for I/O-bound code

from concurrent.futures import ThreadPoolExecutor
import requests

def fetch(url):
    return requests.get(url).status_code

with ThreadPoolExecutor(max_workers=10) as pool:
    results = list(pool.map(fetch, urls))

Existing synchronous libraries (like requests) work unmodified inside threads — no need to rewrite calls as async/await. Downside: each thread has real OS overhead (~MBs of stack space each), so this doesn't scale gracefully to tens of thousands of concurrent tasks.

Multiprocessing: real parallelism for CPU-bound work

from concurrent.futures import ProcessPoolExecutor

def cpu_heavy(n):
    return sum(i * i for i in range(n))

with ProcessPoolExecutor() as pool:
    results = list(pool.map(cpu_heavy, [10**7] * 4))   # genuinely runs on 4 cores

Each process has its own interpreter and GIL, so cpu_heavy genuinely runs in parallel across cores — at the cost of process startup overhead and needing to pickle data across the process boundary (no shared memory by default).

Asyncio: massive I/O concurrency, single thread

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as resp:
        return resp.status

async def main(urls):
    async with aiohttp.ClientSession() as session:
        return await asyncio.gather(*(fetch(session, u) for u in urls))

asyncio.run(main(urls))   # can comfortably handle thousands of concurrent requests

A single thread cooperatively switches between thousands of pending coroutines whenever one is waiting on I/O — no OS thread per task, so memory/scheduling overhead per concurrent task is far lower than threading. The catch: it requires an async-compatible library stack (aiohttp instead of requests, asyncpg instead of a blocking DB driver) — mixing in a blocking call anywhere freezes the entire event loop, not just one task.

Combining them

It's common to combine approaches: use asyncio for I/O concurrency, and delegate genuinely CPU-bound chunks of work to a ProcessPoolExecutor via loop.run_in_executor(...) so they don't block the event loop.

Interview-ready summary: Pick multiprocessing for CPU-bound parallelism (the GIL makes threads useless for this), threading for moderate I/O concurrency with minimal code changes, and asyncio when you need very high I/O concurrency and are willing to adopt an async library stack throughout.

async def creates a coroutine function

async def fetch_data():
    print("start")
    await asyncio.sleep(1)   # suspend here; event loop runs other work meanwhile
    print("done")
    return 42

coro = fetch_data()   # nothing has run yet -- just a coroutine object

Calling fetch_data() does not execute the body — like a generator, it returns a coroutine object that must be driven (via await, asyncio.run, or scheduled as a task) for its code to actually execute.

The event loop: single-threaded cooperative scheduling

import asyncio

async def worker(name, delay):
    print(f"{name} starting")
    await asyncio.sleep(delay)
    print(f"{name} done")

async def main():
    await asyncio.gather(
        worker("A", 2),
        worker("B", 1),
    )

asyncio.run(main())
# A starting
# B starting
# B done      <- after ~1s
# A done      <- after ~2s (not 3s! -- they ran concurrently)

asyncio.gather schedules both worker coroutines as concurrent tasks. When worker("A", 2) hits await asyncio.sleep(2), it tells the event loop "wake me up in 2 seconds, and meanwhile run something else" — the loop then runs worker("B", 1) until it also suspends. This is why the total time is ~2s (the max), not ~3s (the sum): the two sleep calls overlap because a single thread is interleaving them, not running them in true parallel, but scheduling them so neither blocks the other while waiting.

What await actually does

await only works on awaitables (coroutines, Tasks, Futures). It:

  1. Suspends the current coroutine at that point, saving its state (much like a generator's suspended frame — coroutines are, in fact, implemented on the same underlying mechanism as generators).
  2. Registers a callback so the event loop knows to resume this coroutine once the awaited thing completes.
  3. Returns control to the event loop, which picks another ready task/ callback to run.
  4. When the awaited operation finishes, the event loop resumes the original coroutine exactly where it left off, and await evaluates to the awaited thing's result.

Cooperative, not preemptive

Because there's no operating-system-level time-slicing, a coroutine that never awaits anything (e.g., a tight CPU-bound loop with no suspension points) blocks the entire event loop — no other coroutine gets to run until it returns. This is the single most important asyncio rule: only await genuinely yields control; ordinary synchronous code inside an async def function runs to completion without interruption.

Interview-ready summary: The event loop is a single-threaded scheduler running one coroutine at a time; await is the only point where control can be voluntarily handed back to the loop, letting other coroutines make progress while the current one waits on I/O. This cooperative model gives massive I/O concurrency on one thread, but a coroutine that blocks without awaiting stalls every other task.

Related Resources

Same suspension mechanism, different intent

def gen():                  # generator: produces a SEQUENCE of values
    yield 1
    yield 2

async def coro():           # coroutine: produces ONE eventual result
    await asyncio.sleep(1)
    return 42

Both gen() and coro() return objects representing suspended computation rather than running immediately — under the hood, CPython's native coroutines (async def) are implemented with the same frame- suspension machinery that powers generators (historically, asyncio was even built directly on @types.coroutine-decorated generators before native coroutine syntax existed).

How they're driven differs

# Generator: driven by iteration
for value in gen():
    print(value)

# Coroutine: driven by the event loop, via await/asyncio.run
result = await coro()          # inside another coroutine
result = asyncio.run(coro())    # or, at the top level

You can't for loop over a coroutine (it's not iterable in that sense), and you can't await a plain generator (unless it's specifically decorated as a generator-based coroutine, a legacy pattern superseded by async def). Trying to iterate a coroutine directly, or await a plain generator, raises a TypeError.

Purpose: many values vs. one eventual value

  • A generator's job is to lazily produce a sequence: yield each value, potentially infinitely many, consumed one at a time.
  • A coroutine's job is to represent a single asynchronous operation that will eventually complete with one result (or raise) — conceptually closer to a Future/Promise than to an iterator, even though it's implemented with similar suspension internals.

Async generators: a hybrid

async def async_range(n):
    for i in range(n):
        await asyncio.sleep(0)   # yield control back to the event loop
        yield i

async for i in async_range(5):
    print(i)

Python also supports async generators (async def containing yield), which combine both: they lazily produce a sequence and can await between values, consumed with async for instead of a plain for loop — used for streaming data over an async source (e.g., reading paginated results from an async database driver).

Interview-ready summary: Coroutines and generators share the same suspend/resume mechanism, but generators (yield, driven by for/next) model lazily producing a sequence of values, while coroutines (await, driven by the event loop) model a single asynchronous operation resolving to one eventual result — async generators combine both when you need a lazily-produced sequence that can also await I/O between items.

Option 1: separate processes

from concurrent.futures import ProcessPoolExecutor
import math

def is_prime(n):
    if n < 2:
        return False
    return all(n % i for i in range(2, int(math.sqrt(n)) + 1))

numbers = list(range(10_000_000, 10_000_100))
with ProcessPoolExecutor() as pool:
    results = list(pool.map(is_prime, numbers))   # genuinely parallel across cores

Each worker process has its own Python interpreter and its own GIL, so CPU-bound work in different processes truly runs simultaneously on separate cores. The cost: data passed to/from worker processes must be pickled, and process startup has real overhead — this pays off for coarse-grained, CPU-heavy chunks of work, not for many tiny tasks.

Option 2: push the hot loop into native code that releases the GIL

import numpy as np

# Pure Python loop: single-threaded, GIL-bound the whole time
total = sum(x * x for x in range(10_000_000))

# NumPy: the actual multiply-and-sum runs in C, releasing the GIL
arr = np.arange(10_000_000)
total = (arr * arr).sum()

NumPy (and similar C-extension libraries) do the heavy numeric work inside C code that releases the GIL during the computation — this is why NumPy-heavy code can benefit from threads even for "CPU-bound" work: the actual bottleneck has moved out of GIL-held Python bytecode into GIL-free C. Cython supports the same idea explicitly via nogil blocks for hand-written extensions.

Why threading alone doesn't help here

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(is_prime, numbers))   # NOT faster than serial --
                                                     # still one thread executing
                                                     # Python bytecode at a time

Since is_prime is pure Python arithmetic (no I/O, no GIL-releasing C call), running it across threads doesn't parallelize the actual work — the GIL still serializes bytecode execution across all four threads.

Choosing between the two options

  • Reach for multiprocessing when the computation is written in plain Python and can be chunked into independent units of work. process count.
  • Reach for NumPy/Cython/native extensions when the computation is numeric/vectorizable — this usually gives a far larger speedup than multiprocessing alone, since it avoids both GIL contention and Python's general interpreter overhead.

Interview-ready summary: For CPU-bound pure-Python work, use multiprocessing/ProcessPoolExecutor to get separate interpreters (and GILs) running truly in parallel. For numeric/vectorizable work, push the hot loop into a library like NumPy that does the heavy lifting in C and releases the GIL, which often beats multiprocessing's overhead entirely.