How does string interning affect performance and `is` comparisons?

5 minadvancedcollectionsstringsinterningperformance

Quick Answer

CPython automatically **interns** (caches and reuses) certain strings — identifier-like literals (e.g., `"hello"`, variable names) known at compile time, and short strings composed only of letters/digits/underscores — so multiple occurrences of the same literal can share one object in memory, speeding up dict lookups keyed by those strings (interned string comparison can short-circuit to an identity check). It's a CPython optimization detail, not a language guarantee, so code should never rely on `is` for string equality.

Detailed Answer

Interning in action

a = "hello"
b = "hello"
a is b   # True on CPython -- both literals interned to the same object

c = "hello world!"
d = "hello world!"
c is d   # often False -- strings with spaces/punctuation aren't auto-interned

e = "hello" + " world!"   # built at runtime -- typically NOT interned
e is "hello world!"        # unreliable -- don't rely on this

CPython auto-interns string and code-object literals that look like identifiers (letters, digits, underscores) and are known at compile time — this includes most variable names, dict keys defined as literals, and short simple string constants. Strings built dynamically at runtime (concatenation, .format(), f-strings, user input) are generally not automatically interned.

Why this exists: speeding up dict/attribute lookups

Python internally uses dicts extensively (every object's __dict__, every module's namespace, every function's local variables in some representations). If two occurrences of the string "name" used as a dict key are the same interned object, a hash-table lookup can first try a fast identity check (is) before falling back to a full __eq__ comparison — since attribute names repeat constantly across a program, interning meaningfully speeds up this extremely common path.

sys.intern(): forcing it explicitly

import sys

a = sys.intern("some repeated string")
b = sys.intern("some repeated string")
a is b   # True -- explicitly interned

If your program builds many dynamic strings that are frequently repeated and compared/used as dict keys (e.g., parsing a large file with many repeated tokens), explicitly interning them can meaningfully reduce memory (many duplicate strings collapse to one object) and speed up comparisons.

The crucial caveat: never rely on is for string equality

def check(x):
    if x is "yes":   # BUG -- works by luck sometimes, breaks other times
        ...

def check(x):
    if x == "yes":    # correct -- always compares by value
        ...

Interning is a CPython implementation detail that can vary between Python versions, between CPython and other implementations (PyPy, etc.), and even between how a string was constructed. Modern CPython actually raises a SyntaxWarning for is used with string/int literals specifically because of this trap — always use == for value comparison.

Interview-ready summary: CPython interns many string literals to speed up dict/attribute lookups via cheap identity checks, but this is an implementation detail, not a language guarantee — always compare strings with ==, and reach for sys.intern() explicitly only when you've measured a real memory/comparison benefit from deduplicating many repeated dynamic strings.