How does string interning affect performance and `is` comparisons?
Quick Answer
CPython automatically **interns** (caches and reuses) certain strings — identifier-like literals (e.g., `"hello"`, variable names) known at compile time, and short strings composed only of letters/digits/underscores — so multiple occurrences of the same literal can share one object in memory, speeding up dict lookups keyed by those strings (interned string comparison can short-circuit to an identity check). It's a CPython optimization detail, not a language guarantee, so code should never rely on `is` for string equality.
Detailed Answer
Interning in action
a = "hello"
b = "hello"
a is b # True on CPython -- both literals interned to the same object
c = "hello world!"
d = "hello world!"
c is d # often False -- strings with spaces/punctuation aren't auto-interned
e = "hello" + " world!" # built at runtime -- typically NOT interned
e is "hello world!" # unreliable -- don't rely on this
CPython auto-interns string and code-object literals that look like
identifiers (letters, digits, underscores) and are known at compile time —
this includes most variable names, dict keys defined as literals, and
short simple string constants. Strings built dynamically at runtime
(concatenation, .format(), f-strings, user input) are generally not
automatically interned.
Why this exists: speeding up dict/attribute lookups
Python internally uses dicts extensively (every object's __dict__,
every module's namespace, every function's local variables in some
representations). If two occurrences of the string "name" used as a
dict key are the same interned object, a hash-table lookup can first
try a fast identity check (is) before falling back to a full __eq__
comparison — since attribute names repeat constantly across a program,
interning meaningfully speeds up this extremely common path.
sys.intern(): forcing it explicitly
import sys
a = sys.intern("some repeated string")
b = sys.intern("some repeated string")
a is b # True -- explicitly interned
If your program builds many dynamic strings that are frequently repeated and compared/used as dict keys (e.g., parsing a large file with many repeated tokens), explicitly interning them can meaningfully reduce memory (many duplicate strings collapse to one object) and speed up comparisons.
The crucial caveat: never rely on is for string equality
def check(x):
if x is "yes": # BUG -- works by luck sometimes, breaks other times
...
def check(x):
if x == "yes": # correct -- always compares by value
...
Interning is a CPython implementation detail that can vary between Python
versions, between CPython and other implementations (PyPy, etc.), and even
between how a string was constructed. Modern CPython actually raises a
SyntaxWarning for is used with string/int literals specifically
because of this trap — always use == for value comparison.
Interview-ready summary: CPython interns many string literals to
speed up dict/attribute lookups via cheap identity checks, but this is an
implementation detail, not a language guarantee — always compare strings
with ==, and reach for sys.intern() explicitly only when you've
measured a real memory/comparison benefit from deduplicating many
repeated dynamic strings.