How do generators help with memory efficiency when processing large datasets?

Detailed Answer

The eager approach: loads everything at once

def read_all_lines(path):
    with open(path) as f:
        return f.readlines()   # entire file in memory as a list of strings

lines = read_all_lines("100gb.log")   # boom -- won't fit in memory

The generator approach: constant memory, streamed

def read_lines(path):
    with open(path) as f:
        for line in f:            # file objects are themselves iterators
            yield line.strip()

for line in read_lines("100gb.log"):   # one line in memory at a time
    process(line)

Since a file object is already an iterator over its lines, wrapping it in a generator function costs nothing extra in memory — only the current line (plus whatever process() needs) is ever resident, regardless of whether the file is 1KB or 100GB.

Chaining generators into a lazy pipeline

def read_lines(path):
    with open(path) as f:
        yield from (line.strip() for line in f)

def parse(lines):
    for line in lines:
        yield line.split(",")

def filter_valid(records):
    for r in records:
        if len(r) == 3:
            yield r

pipeline = filter_valid(parse(read_lines("data.csv")))
for record in pipeline:
    handle(record)

Crucially, this pipeline processes one row all the way through (read → parse → filter → handle) before reading the next row — it never builds an intermediate list at any stage. Compare this to writing each step as a list comprehension: filter_valid(parse(read_lines(...))) would fully read the file, then fully parse every line, then fully filter, each as a separate full pass building a full intermediate list — for large data, that's the difference between using a few KB of memory and running out of RAM.

The tradeoff: no random access, single-pass only

Generators give up len(), indexing, and re-iteration in exchange for constant memory. That's the right trade for streaming/one-pass processing; if you genuinely need to look at the data multiple times or access it by index, you need it materialized (a list) at some point regardless — generators just let you defer or avoid that when you don't.

Interview-ready summary: Generators trade eager, full-memory computation for lazy, one-item-at-a-time computation, which is what lets Python process files, streams, or datasets far larger than available memory. Chaining generators together builds a pipeline that processes each item through every stage before moving to the next, avoiding intermediate list materialization entirely.