How do generators help with memory efficiency when processing large datasets?
Quick Answer
A generator processes one item at a time and never materializes the full dataset in memory, so you can stream through a file, database cursor, or network response of arbitrary size using **constant memory** instead of O(n). Chaining multiple generator-based transformation steps together builds a lazy pipeline where each item flows through all stages before the next item is even read, rather than each stage completing fully before the next starts.
Detailed Answer
The eager approach: loads everything at once
def read_all_lines(path):
with open(path) as f:
return f.readlines() # entire file in memory as a list of strings
lines = read_all_lines("100gb.log") # boom -- won't fit in memory
The generator approach: constant memory, streamed
def read_lines(path):
with open(path) as f:
for line in f: # file objects are themselves iterators
yield line.strip()
for line in read_lines("100gb.log"): # one line in memory at a time
process(line)
Since a file object is already an iterator over its lines, wrapping it in
a generator function costs nothing extra in memory — only the current
line (plus whatever process() needs) is ever resident, regardless of
whether the file is 1KB or 100GB.
Chaining generators into a lazy pipeline
def read_lines(path):
with open(path) as f:
yield from (line.strip() for line in f)
def parse(lines):
for line in lines:
yield line.split(",")
def filter_valid(records):
for r in records:
if len(r) == 3:
yield r
pipeline = filter_valid(parse(read_lines("data.csv")))
for record in pipeline:
handle(record)
Crucially, this pipeline processes one row all the way through (read →
parse → filter → handle) before reading the next row — it never builds an
intermediate list at any stage. Compare this to writing each step as a
list comprehension: filter_valid(parse(read_lines(...))) would fully
read the file, then fully parse every line, then fully filter, each as a
separate full pass building a full intermediate list — for large data,
that's the difference between using a few KB of memory and running out of
RAM.
The tradeoff: no random access, single-pass only
Generators give up len(), indexing, and re-iteration in exchange for
constant memory. That's the right trade for streaming/one-pass
processing; if you genuinely need to look at the data multiple times or
access it by index, you need it materialized (a list) at some point
regardless — generators just let you defer or avoid that when you don't.
Interview-ready summary: Generators trade eager, full-memory computation for lazy, one-item-at-a-time computation, which is what lets Python process files, streams, or datasets far larger than available memory. Chaining generators together builds a pipeline that processes each item through every stage before moving to the next, avoiding intermediate list materialization entirely.