What are some practical techniques to optimize slow Python code?

Detailed Answer

1. Fix algorithmic complexity before anything else

# O(n^2) -- checking membership in a list, inside a loop over another list
duplicates = [x for x in list_a if x in list_b]        # list_b scanned linearly, every time

# O(n) -- convert to a set once, then O(1) average membership checks
set_b = set(list_b)
duplicates = [x for x in list_a if x in set_b]

No amount of micro-optimizing the O(n²) version beats simply removing the quadratic behavior — this single change (converting a repeatedly-scanned list into a set) is often the highest-leverage optimization available in real code.

2. Prefer built-ins and comprehensions over manual Python loops

# Slower -- Python-level loop overhead on every iteration
total = 0
for x in numbers:
    total += x * x

# Faster -- sum()/comprehension push the loop into C
total = sum(x * x for x in numbers)

Built-in functions (sum, map, sorted, any, all) and comprehensions are implemented in C and avoid the per-iteration overhead of the Python bytecode interpreter loop — a consistent, easy win whenever applicable.

3. Cache expensive, pure computations

from functools import lru_cache

@lru_cache(maxsize=None)
def expensive_pure_function(n):
    ...

Memoization turns repeated calls with the same arguments from full recomputation into O(1) lookups — a large win whenever a pure function is called repeatedly with overlapping arguments (see the lru_cache question for caveats).

4. Push numeric/hot loops into vectorized or compiled code

# Pure Python loop -- slow for large arrays
result = [x * 2 + 1 for x in large_list]

# NumPy -- the actual loop runs in C, operating on contiguous memory
import numpy as np
arr = np.array(large_list)
result = arr * 2 + 1

For numeric workloads, NumPy's vectorized operations (and, for more custom logic, Cython or a Rust extension via PyO3) can be one to two orders of magnitude faster than an equivalent pure-Python loop, since they avoid both the per-element interpreter overhead and (for NumPy) operate on tightly-packed memory instead of a list of individually boxed Python objects.

5. Avoid unnecessary object creation in hot paths

# Builds an intermediate list just to throw it away
if [x for x in items if x.active]:   # wasteful -- builds the whole list just to check truthiness
    ...

# any() short-circuits on the first match -- no full list built
if any(x.active for x in items):
    ...

Small allocation-avoidance changes (generator expressions over list comprehensions when the full list isn't needed, str.join() over repeated += string concatenation in a loop) add up in hot code paths.

6. Always profile before and after

Every technique above should be validated against a cProfile/timeit measurement of the actual bottleneck — optimizing code that isn't on the hot path adds complexity and risk for no measurable benefit.

Interview-ready summary: Algorithmic complexity dominates all other optimizations — fix O(n²) patterns first. After that, prefer built-ins/ vectorized operations over manual Python loops, cache pure expensive computations, and avoid unnecessary intermediate allocations — but always let profiling data, not intuition, decide where to spend optimization effort.