Collections & Data Structures

Difficulty

Lists are dynamic arrays, not linked lists

CPython's list stores a contiguous array of pointers to objects, with some extra pre-allocated capacity to amortize the cost of growth. This is why indexing is O(1): lst[i] is just pointer arithmetic into the underlying array, not a traversal.

lst = [1, 2, 3]
lst[1]          # O(1) -- direct array index
lst.append(4)   # O(1) amortized -- usually just writes into pre-allocated space

Complexity cheat sheet

OperationComplexityWhy
lst[i], lst[i] = xO(1)direct array index
lst.append(x)O(1) amortizedwrites to spare capacity; occasional O(n) resize, amortized away
lst.pop() (from end)O(1)no shifting needed
lst.pop(0), lst.insert(0, x)O(n)every remaining element shifts by one slot
x in lstO(n)linear scan, no index structure
len(lst)O(1)length is cached, not recomputed
lst.sort()O(n log n)Timsort
slicing lst[a:b]O(k)k = length of the slice, since a new list is built

Why append is amortized O(1)

When the underlying array runs out of spare capacity, CPython allocates a larger array (roughly growing by ~1.125x plus a constant) and copies existing elements over — an O(n) operation, but one that happens increasingly rarely as the list grows, so the average cost per append across many appends works out to O(1).

Why front-insertion/removal is expensive

lst = [1, 2, 3, 4, 5]
lst.pop(0)      # removes 1, then shifts 2,3,4,5 each one slot left -- O(n)
lst.insert(0, 0) # shifts every element right by one slot -- O(n)

If your workload needs frequent insert/pop from both ends, use collections.deque instead — it's implemented as a doubly-linked list of fixed-size blocks and supports O(1) append/pop from either end, at the cost of O(n) random access (deque[i] is O(n) for large i, unlike list).

Interview-ready summary: Python lists are dynamic arrays: O(1) indexing and end-append/pop, but O(n) for inserting/removing anywhere except the end, and O(n) membership testing. When you need efficient operations at both ends, reach for collections.deque instead of list.

The hash table basics

d = {"a": 1, "b": 2}
d["a"]        # O(1) average -- hash("a") locates the slot directly
d["c"] = 3    # O(1) average insert

Looking up d["a"] computes hash("a"), uses it to find a candidate slot in the internal table, and (after resolving any hash collisions) returns the value — no scanning of all keys, unlike a list.

Insertion order guarantee (Python 3.7+)

d = {}
d["z"] = 1
d["a"] = 2
d["m"] = 3
list(d)   # ['z', 'a', 'm']  -- insertion order, guaranteed since 3.7

This was a CPython implementation detail in 3.6 and became an official language guarantee in 3.7. Internally, CPython separates "which slot does this hash map to" (a sparse array of indices) from "the actual key/value/ hash entries" (a dense array kept in insertion order) — iterating a dict walks the dense array, which is naturally in insertion order, while lookups still use the sparse hash-indexed array for O(1) access.

Why keys must be hashable

d = {[1, 2]: "x"}   # TypeError: unhashable type: 'list'

A dict key's hash is computed once and used to place it in a slot; mutating the key afterward (which is only possible for mutable objects) would silently break the invariant that a key's slot matches its current hash. That's why list/dict/set (all mutable) can't be dict keys, but tuple, str, int, and frozenset (all immutable, and hashable if their contents are) can.

Collision resolution: open addressing

Unlike some languages' hash maps (which chain multiple entries per bucket, e.g. a linked list per slot), CPython dicts use open addressing: on a collision, a perturbation-based probing sequence finds the next candidate slot. This keeps memory more compact and cache-friendly than chaining, at the cost of needing careful resizing (the table is resized — and re-hashed — well before it gets too full, keeping average lookup close to O(1)).

Worst case

Pathological hash collisions can degrade lookups toward O(n) in theory, but CPython uses SipHash for string hashing with a randomized seed per process (PYTHONHASHSEED) specifically to make it infeasible for an attacker to engineer such collisions deliberately (a real denial-of-service vector in older hash table implementations).

Interview-ready summary: dict is a hash table giving average O(1) lookup/insert/delete; since Python 3.7 it's spec-guaranteed to preserve insertion order, achieved by keeping entries in a dense, insertion-ordered array separate from the sparse hash-index table used for O(1) lookups. Keys must be hashable (and therefore effectively immutable) since a key's slot is determined by its hash at insertion time.

Related Resources

What "hashable" requires

hash(42)            # works -- int is hashable
hash("abc")          # works -- str is hashable
hash((1, 2, 3))       # works -- tuple of hashables is hashable
hash([1, 2, 3])       # TypeError: unhashable type: 'list'
hash({1, 2})          # TypeError: unhashable type: 'set'
hash(frozenset({1,2})) # works -- frozenset (immutable set) is hashable

Hashability requires: (1) a __hash__ method that returns the same integer every time for a given object's lifetime, and (2) if __eq__ is defined, a == b implies hash(a) == hash(b) — this is required for correct dict/set behavior (two "equal" keys must land findable in the same bucket).

Why mutable containers are unhashable

lst = [1, 2, 3]
d = {lst: "value"}   # TypeError -- if this worked...

lst.append(4)         # ...and this mutated the key after insertion,
d[lst]                 # the dict's internal slot (based on the OLD hash) would be wrong

If list were hashable and its hash were based on contents, mutating a list already used as a dict key would silently corrupt the dict's internal structure (the key's slot no longer matches its current hash). Python sidesteps this entirely by making mutable containers unhashable.

Custom classes: hashable by default (via identity)

class Point:
    def __init__(self, x, y):
        self.x, self.y = x, y

p = Point(1, 2)
hash(p)   # works! -- default __hash__ is based on id(), inherited from object

By default, custom classes inherit object.__hash__, based on identity (id()) — two distinct Point(1, 2) instances hash differently even though they'd naturally seem "equal." This is fine until you also define __eq__ for value-based equality:

class Point:
    def __init__(self, x, y):
        self.x, self.y = x, y
    def __eq__(self, other):
        return isinstance(other, Point) and (self.x, self.y) == (other.x, other.y)

hash(Point(1, 2))   # TypeError: unhashable type: 'Point'

Defining __eq__ makes Python set __hash__ = None automatically, because the identity-based default hash would now violate the "equal implies equal hash" contract. Fix by defining a consistent __hash__:

    def __hash__(self):
        return hash((self.x, self.y))

Interview-ready summary: Hashable means "has a stable __hash__, and if __eq__ is defined, equal objects hash equally." Built-in mutable containers are unhashable by design (mutation would corrupt any dict/set using them as keys); custom classes are hashable by identity by default, but overriding __eq__ disables that default hash until you provide a matching __hash__ explicitly.

Quick decision table

NeedStructureWhy
Ordered, mutable collection, duplicates OKlistgeneral-purpose sequence
Fixed-size, immutable record ((x, y), (name, age))tuplesafe to hash, signals "this won't change"
Fast membership testing, deduplicationsetO(1) average in, automatic uniqueness
Lookup by key/namedictO(1) average lookup by key, not position

Concrete examples

# list -- ordered sequence of items, order and duplicates matter
scores = [85, 92, 85, 78]

# tuple -- an immutable, fixed-shape record
point = (3, 4)
person = ("Ada", 36)

# set -- membership and uniqueness, order doesn't matter
seen_ids = {101, 205, 310}
if user_id in seen_ids:   # O(1) average -- much faster than `in a_list` for large data
    ...
unique_tags = set(all_tags)  # dedupe in one line

# dict -- look things up by key
users_by_id = {101: "Ada", 205: "Grace"}
users_by_id[101]   # O(1) average

Why x in set beats x in list at scale

big_list = list(range(1_000_000))
big_set = set(big_list)

999_999 in big_list   # O(n) -- scans up to a million elements
999_999 in big_set     # O(1) average -- direct hash lookup

For any workload doing repeated membership checks against a large collection, converting to a set (or using a dict if you also need associated values) is one of the cheapest, highest-impact optimizations available.

Tuple vs list: signaling intent, not just performance

def get_coordinates():
    return (self.x, self.y)   # a tuple signals "this is a fixed 2-item record"

Beyond being hashable (usable as dict keys/set elements) and slightly more memory-efficient, using a tuple for a fixed-shape value communicates to readers that the shape — not just the values — is meant to be fixed: nobody should expect to .append() to it.

Interview-ready summary: Reach for list for ordered, mutable sequences; tuple for fixed-shape, immutable records (and anything you need to hash); set for uniqueness/fast membership testing; dict whenever you look things up by key rather than by position. The performance difference between O(n) list scans and O(1) set/dict lookups is often the single biggest algorithmic win available in everyday code.

The three comprehension forms

squares = [x * x for x in range(10)]                    # list
evens = {x for x in range(10) if x % 2 == 0}             # set
lookup = {x: x * x for x in range(10)}                   # dict
gen = (x * x for x in range(10))                          # generator expression (lazy!)

Each desugars roughly to a loop that appends/adds/assigns into a new container (except the generator expression, which produces a lazy iterator instead of eagerly building a container — see the iterators/generators topic).

Why they're often faster than an explicit loop

# Comprehension
squares = [x * x for x in range(1000)]

# Equivalent explicit loop
squares = []
for x in range(1000):
    squares.append(x * x)

Both do the same logical work, but the comprehension compiles to specialized bytecode (LIST_APPEND inside an isolated function frame) that avoids repeated attribute lookups (squares.append) on every iteration — in practice this is a modest, not dramatic, speedup, so "use comprehensions for performance" is a secondary benefit, not the main reason to reach for them.

When they hurt readability

# Hard to read -- nested comprehension with a condition
result = [item.strip().lower() for sublist in data
          for item in sublist if item and not item.startswith("#")]

# Clearer as an explicit loop
result = []
for sublist in data:
    for item in sublist:
        if item and not item.startswith("#"):
            result.append(item.strip().lower())

A comprehension nested two or more levels deep, or one combining multiple if conditions and a non-trivial expression, usually reads worse than the equivalent loop — there's no room for named intermediate variables or comments explaining a non-obvious filter, and reviewers have to mentally un-nest the comprehension to understand execution order (which, notably, goes left-to-right the way the for/if clauses are written, not inside- out).

A good rule of thumb

If a comprehension needs more than one for clause or more than one condition to express the logic, or if the transformation expression itself needs a helper function to stay readable, switch to an explicit loop (or a generator function with clear intermediate steps).

Interview-ready summary: Comprehensions are syntax sugar for building a list/set/dict via a loop, and are typically a bit faster due to specialized bytecode — but that's secondary to their real value: concise, readable code for a simple transform-and-filter. Once nesting or conditions pile up, prefer an explicit loop over a comprehension that's technically correct but hard to read.