Structure — shared by set and frozenset
Set in Python is implemented as a hash table with open addressing. This means all elements are stored in a single contiguous fixed-size array in memory, rather than in a linked list or a tree.
Each slot in the array can be in one of three states: - Empty — never used. - Occupied — contains an element. - Deleted (dummy) — an element was present but has been removed (tombstone).
Internally, each occupied slot stores two fields:
- hash — the cached hash value of the element (an integer, Py_hash_t).
- key — a pointer to the Python object itself.
Caching the hash is critical: for union, intersection, and lookup operations, the hash doesn't need to be recalculated — it's already stored right next to the element.
The initial table size is 8 slots.
Python maintains a load factor ≤ 2/3 (i.e., no more than ~66% of slots are occupied). As soon as this threshold is exceeded, a resize occurs.
During resize, the table is rebuilt into a new array of size: - ×4 — if the set is small. - ×2 — if the set is already large.
The table size is always a power of two — this allows replacing expensive division operations with a fast bitwise AND when calculating the slot index.
The index of the first candidate is calculated as hash & (size - 1) — a bitwise AND with a mask, which is equivalent to hash % size but significantly faster.
Python does not use pure linear or quadratic probing. Instead, it uses a perturbation algorithm:
At each probing step, the new index is calculated using the formula:
i = (i * 5 + 1 + perturb) & mask
perturb >>= 5
The variable perturb is initialized with the original hash value and shifted right by 5 bits at each iteration. This ensures that eventually all bits of the hash are utilized, not just the lower ones — providing a much better distribution in case of collisions.
Unlike dict, which in 3.6 gained compact split storage (index array + dense entry array), set still uses a classic unified hash table — a sparse array where empty slots occupy memory. This is a deliberate trade-off: set is optimized for lookup speed and set operations, rather than insertion order.
When an element is deleted, it is not cleared completely — the slot is marked as dummy (tombstone). This is necessary for lookup correctness: if you simply clear the slot, a probing chain would be broken, making elements added after the collision unreachable.
Dummy slots are accounted for during resize: the table is rebuilt when too many tombstones accumulate — they occupy space but carry no data.
The union, intersection, and difference operations are implemented at the C level and use internal structures directly, without Python-level iteration:
| Operation | Strategy |
|---|---|
\| union |
iterate over the smaller set, insert into a copy of the larger one |
& intersection |
iterate over the smaller set, check membership in the larger one |
- difference |
iterate over the left set, exclude those found in the right one |
^ symmetric diff |
union of two differences |
The choice to "iterate over the smaller set" is a deliberate optimization: fewer iterations, fewer lookups.
Sets require elements to be hashable (__hash__) and comparable (__eq__). In case of a hash collision, Python first compares the hashes themselves (fast, using the cached value) and only if they match calls __eq__ — this minimizes expensive object comparisons.
Memory estimation — sys.getsizeof returns the size of only the hash table itself, excluding the element objects. Due to its sparsity, a hash table occupies more memory than a corresponding list — this is the direct cost for O(1) lookups.
Empty set: ~200 bytes (8 slots)
After resize: size doubles/quadruples in steps
dict since 3.7.PySetObject — common for set and frozenset)┌──────────────────────────────────────────┐
│ ob_refcnt (8 bytes) │
│ ob_type → frozenset (8 bytes) │
├──────────────────────────────────────────┤
│ fill (8 bytes) ← occupied + dummy│
│ used (8 bytes) ← live elements │
│ mask (8 bytes) ← capacity - 1 │
│ table* (8 bytes) ← pointer to HT │
│ hash (8 bytes) ← cached hash │
│ finger (8 bytes) ← iterator cursor │
│ smalltable (216 bytes)← inline buffer │
│ (8 slotentries × 27 bytes) │
└──────────────────────────────────────────┘
Each slot (setentry):
┌──────────────────┬───────────────────┐
│ hash (8 bytes) │ key* (8 bytes) │
└──────────────────┴───────────────────┘
smalltable is used (inline, no malloc).table switches to an external heap array.i = (5*i + 1 + perturb) % capacity; perturb >>= 5.| set | frozenset | |
|---|---|---|
| Memory structure | identical | identical |
| Mutation | allowed | forbidden (type check) |
| Hashable | no | yes (cached in hash field) |
| Can be dict key | no | yes |
hash Field in PySetObject — ClarificationIn the CPython source code (Objects/setobject.c), a single structure PySetObject is used for both types:
typedef struct {
ob_refcnt
ob_type
fill
used
mask
table*
hash ← present in the structure for both
finger
smalltable[8]
weakreflist
} PySetObject;
The hash field is physically present in memory for both set and frozenset because the structure is the same.
set |
frozenset |
|
|---|---|---|
hash field in memory |
present | present |
| Initial value | -1 (sentinel) |
-1 |
| When calculated | never | upon the first call to hash() |
| Cached | not useful | yes, written to the field |
__hash__ method |
None (explicitly nullified) |
calculates and caches |
The
hashfield exists in both objects at the C structure level, but is only used byfrozenset. Forset, it always remains-1and is never utilized because__hash__ = Noneat the type level.