What are some practical techniques to optimize slow Python code?
Quick Answer
In order of typical impact: fix algorithmic complexity first (an O(n²) loop beats any micro-optimization of an O(n) one); replace `list`-based membership checks with `set`/`dict`; push hot numeric loops into vectorized libraries (NumPy) or compiled extensions (Cython, Rust via PyO3); use built-in functions/comprehensions (implemented in C) over manual Python loops; cache expensive pure computations with `functools.lru_cache`; and only reach for micro-optimizations after profiling confirms where time is actually spent.
Detailed Answer
1. Fix algorithmic complexity before anything else
# O(n^2) -- checking membership in a list, inside a loop over another list
duplicates = [x for x in list_a if x in list_b] # list_b scanned linearly, every time
# O(n) -- convert to a set once, then O(1) average membership checks
set_b = set(list_b)
duplicates = [x for x in list_a if x in set_b]
No amount of micro-optimizing the O(n²) version beats simply removing the quadratic behavior — this single change (converting a repeatedly-scanned list into a set) is often the highest-leverage optimization available in real code.
2. Prefer built-ins and comprehensions over manual Python loops
# Slower -- Python-level loop overhead on every iteration
total = 0
for x in numbers:
total += x * x
# Faster -- sum()/comprehension push the loop into C
total = sum(x * x for x in numbers)
Built-in functions (sum, map, sorted, any, all) and
comprehensions are implemented in C and avoid the per-iteration overhead
of the Python bytecode interpreter loop — a consistent, easy win whenever
applicable.
3. Cache expensive, pure computations
from functools import lru_cache
@lru_cache(maxsize=None)
def expensive_pure_function(n):
...
Memoization turns repeated calls with the same arguments from full
recomputation into O(1) lookups — a large win whenever a pure function is
called repeatedly with overlapping arguments (see the lru_cache
question for caveats).
4. Push numeric/hot loops into vectorized or compiled code
# Pure Python loop -- slow for large arrays
result = [x * 2 + 1 for x in large_list]
# NumPy -- the actual loop runs in C, operating on contiguous memory
import numpy as np
arr = np.array(large_list)
result = arr * 2 + 1
For numeric workloads, NumPy's vectorized operations (and, for more custom logic, Cython or a Rust extension via PyO3) can be one to two orders of magnitude faster than an equivalent pure-Python loop, since they avoid both the per-element interpreter overhead and (for NumPy) operate on tightly-packed memory instead of a list of individually boxed Python objects.
5. Avoid unnecessary object creation in hot paths
# Builds an intermediate list just to throw it away
if [x for x in items if x.active]: # wasteful -- builds the whole list just to check truthiness
...
# any() short-circuits on the first match -- no full list built
if any(x.active for x in items):
...
Small allocation-avoidance changes (generator expressions over list
comprehensions when the full list isn't needed, str.join() over
repeated += string concatenation in a loop) add up in hot code paths.
6. Always profile before and after
Every technique above should be validated against a cProfile/timeit
measurement of the actual bottleneck — optimizing code that isn't on
the hot path adds complexity and risk for no measurable benefit.
Interview-ready summary: Algorithmic complexity dominates all other optimizations — fix O(n²) patterns first. After that, prefer built-ins/ vectorized operations over manual Python loops, cache pure expensive computations, and avoid unnecessary intermediate allocations — but always let profiling data, not intuition, decide where to spend optimization effort.