Python Set — Internal Structure

Basis: Hash Table

Structure — shared by set and frozenset

Set in Python is implemented as a hash table with open addressing. This means all elements are stored in a single contiguous fixed-size array in memory, rather than in a linked list or a tree.

Each slot in the array can be in one of three states: - Empty — never used. - Occupied — contains an element. - Deleted (dummy) — an element was present but has been removed (tombstone).

Storage Structure

Internally, each occupied slot stores two fields: - hash — the cached hash value of the element (an integer, Py_hash_t). - key — a pointer to the Python object itself.

Caching the hash is critical: for union, intersection, and lookup operations, the hash doesn't need to be recalculated — it's already stored right next to the element.

Size and Load Factor

The initial table size is 8 slots.

Python maintains a load factor ≤ 2/3 (i.e., no more than ~66% of slots are occupied). As soon as this threshold is exceeded, a resize occurs.

During resize, the table is rebuilt into a new array of size: - ×4 — if the set is small. - ×2 — if the set is already large.

The table size is always a power of two — this allows replacing expensive division operations with a fast bitwise AND when calculating the slot index.

Slot Search Mechanism (Probing)

Calculating Initial Index

The index of the first candidate is calculated as hash & (size - 1) — a bitwise AND with a mask, which is equivalent to hash % size but significantly faster.

Collision Resolution — Perturbation Probing

Python does not use pure linear or quadratic probing. Instead, it uses a perturbation algorithm:

At each probing step, the new index is calculated using the formula:

i = (i * 5 + 1 + perturb) & mask
perturb >>= 5

The variable perturb is initialized with the original hash value and shifted right by 5 bits at each iteration. This ensures that eventually all bits of the hash are utilized, not just the lower ones — providing a much better distribution in case of collisions.

Compact Storage (Since CPython 3.6+ for dict; set is different)

Unlike dict, which in 3.6 gained compact split storage (index array + dense entry array), set still uses a classic unified hash table — a sparse array where empty slots occupy memory. This is a deliberate trade-off: set is optimized for lookup speed and set operations, rather than insertion order.

Element Deletion and Dummy Slots

When an element is deleted, it is not cleared completely — the slot is marked as dummy (tombstone). This is necessary for lookup correctness: if you simply clear the slot, a probing chain would be broken, making elements added after the collision unreachable.

Dummy slots are accounted for during resize: the table is rebuilt when too many tombstones accumulate — they occupy space but carry no data.

Set Operations

The union, intersection, and difference operations are implemented at the C level and use internal structures directly, without Python-level iteration:

Operation	Strategy
`\\|` union	iterate over the smaller set, insert into a copy of the larger one
`&` intersection	iterate over the smaller set, check membership in the larger one
`-` difference	iterate over the left set, exclude those found in the right one
`^` symmetric diff	union of two differences

The choice to "iterate over the smaller set" is a deliberate optimization: fewer iterations, fewer lookups.

Element Hashability

Sets require elements to be hashable (__hash__) and comparable (__eq__). In case of a hash collision, Python first compares the hashes themselves (fast, using the cached value) and only if they match calls __eq__ — this minimizes expensive object comparisons.

Memory

Memory estimation — sys.getsizeof returns the size of only the hash table itself, excluding the element objects. Due to its sparsity, a hash table occupies more memory than a corresponding list — this is the direct cost for O(1) lookups.

Empty set:        ~200 bytes (8 slots)
After resize:     size doubles/quadruples in steps

Summary: Key Properties

Lookup, insertion, deletion — amortized O(1), worst case O(n) with pathological hashes.
Iteration — O(n) over the table, including empty slots → order is unpredictable.
No Order — intentional, unlike dict since 3.7.
Thread Safety — absent; the GIL only provides atomicity for individual bytecode instructions.

Structure (`PySetObject` — common for set and frozenset)

┌──────────────────────────────────────────┐
│ ob_refcnt           (8 bytes)            │
│ ob_type → frozenset (8 bytes)            │
├──────────────────────────────────────────┤
│ fill        (8 bytes)  ← occupied + dummy│
│ used        (8 bytes)  ← live elements   │
│ mask        (8 bytes)  ← capacity - 1    │
│ table*      (8 bytes)  ← pointer to HT   │
│ hash        (8 bytes)  ← cached hash     │
│ finger      (8 bytes)  ← iterator cursor │
│ smalltable  (216 bytes)← inline buffer   │
│             (8 slotentries × 27 bytes)   │
└──────────────────────────────────────────┘

Each slot (setentry):

┌──────────────────┬───────────────────┐
│ hash (8 bytes)   │ key* (8 bytes)    │
└──────────────────┴───────────────────┘

For ≤ 8 elements, smalltable is used (inline, no malloc).
Upon overflow, table switches to an external heap array.
Load factor is maintained ≤ 2/3 → when exceeded, resize by ×4 (or ×2 for large sets).
Collisions are resolved via perturbation probing: i = (5*i + 1 + perturb) % capacity; perturb >>= 5.

frozenset vs set

	set	frozenset
Memory structure	identical	identical
Mutation	allowed	forbidden (type check)
Hashable	no	yes (cached in `hash` field)
Can be dict key	no	yes

`hash` Field in `PySetObject` — Clarification

How it actually works

In the CPython source code (Objects/setobject.c), a single structure PySetObject is used for both types:

typedef struct {
    ob_refcnt
    ob_type
    fill
    used
    mask
    table*
    hash        ← present in the structure for both
    finger
    smalltable[8]
    weakreflist
} PySetObject;

The hash field is physically present in memory for both set and frozenset because the structure is the same.

Difference in Behavior, Not Structure

	`set`	`frozenset`
`hash` field in memory	present	present
Initial value	`-1` (sentinel)	`-1`
When calculated	never	upon the first call to `hash()`
Cached	not useful	yes, written to the field
`__hash__` method	`None` (explicitly nullified)	calculates and caches

The hash field exists in both objects at the C structure level, but is only used by frozenset. For set, it always remains -1 and is never utilized because __hash__ = None at the type level.