Python Set — Internal Structure

Basis: Hash Table

Structure — shared by set and frozenset

Set in Python is implemented as a hash table with open addressing. This means all elements are stored in a single contiguous fixed-size array in memory, rather than in a linked list or a tree.

Each slot in the array can be in one of three states: - Empty — never used. - Occupied — contains an element. - Deleted (dummy) — an element was present but has been removed (tombstone).


Storage Structure

Internally, each occupied slot stores two fields: - hash — the cached hash value of the element (an integer, Py_hash_t). - key — a pointer to the Python object itself.

Caching the hash is critical: for union, intersection, and lookup operations, the hash doesn't need to be recalculated — it's already stored right next to the element.


Size and Load Factor

The initial table size is 8 slots.

Python maintains a load factor ≤ 2/3 (i.e., no more than ~66% of slots are occupied). As soon as this threshold is exceeded, a resize occurs.

During resize, the table is rebuilt into a new array of size: - ×4 — if the set is small. - ×2 — if the set is already large.

The table size is always a power of two — this allows replacing expensive division operations with a fast bitwise AND when calculating the slot index.


Slot Search Mechanism (Probing)

Calculating Initial Index

The index of the first candidate is calculated as hash & (size - 1) — a bitwise AND with a mask, which is equivalent to hash % size but significantly faster.

Collision Resolution — Perturbation Probing

Python does not use pure linear or quadratic probing. Instead, it uses a perturbation algorithm:

At each probing step, the new index is calculated using the formula:

i = (i * 5 + 1 + perturb) & mask
perturb >>= 5

The variable perturb is initialized with the original hash value and shifted right by 5 bits at each iteration. This ensures that eventually all bits of the hash are utilized, not just the lower ones — providing a much better distribution in case of collisions.


Compact Storage (Since CPython 3.6+ for dict; set is different)

Unlike dict, which in 3.6 gained compact split storage (index array + dense entry array), set still uses a classic unified hash table — a sparse array where empty slots occupy memory. This is a deliberate trade-off: set is optimized for lookup speed and set operations, rather than insertion order.


Element Deletion and Dummy Slots

When an element is deleted, it is not cleared completely — the slot is marked as dummy (tombstone). This is necessary for lookup correctness: if you simply clear the slot, a probing chain would be broken, making elements added after the collision unreachable.

Dummy slots are accounted for during resize: the table is rebuilt when too many tombstones accumulate — they occupy space but carry no data.


Set Operations

The union, intersection, and difference operations are implemented at the C level and use internal structures directly, without Python-level iteration:

Operation Strategy
\| union iterate over the smaller set, insert into a copy of the larger one
& intersection iterate over the smaller set, check membership in the larger one
- difference iterate over the left set, exclude those found in the right one
^ symmetric diff union of two differences

The choice to "iterate over the smaller set" is a deliberate optimization: fewer iterations, fewer lookups.


Element Hashability

Sets require elements to be hashable (__hash__) and comparable (__eq__). In case of a hash collision, Python first compares the hashes themselves (fast, using the cached value) and only if they match calls __eq__ — this minimizes expensive object comparisons.


Memory

Memory estimation — sys.getsizeof returns the size of only the hash table itself, excluding the element objects. Due to its sparsity, a hash table occupies more memory than a corresponding list — this is the direct cost for O(1) lookups.

Empty set:        ~200 bytes (8 slots)
After resize:     size doubles/quadruples in steps

Summary: Key Properties


Structure (PySetObject — common for set and frozenset)

┌──────────────────────────────────────────┐
│ ob_refcnt           (8 bytes)            │
│ ob_type → frozenset (8 bytes)            │
├──────────────────────────────────────────┤
│ fill        (8 bytes)  ← occupied + dummy│
│ used        (8 bytes)  ← live elements   │
│ mask        (8 bytes)  ← capacity - 1    │
│ table*      (8 bytes)  ← pointer to HT   │
│ hash        (8 bytes)  ← cached hash     │
│ finger      (8 bytes)  ← iterator cursor │
│ smalltable  (216 bytes)← inline buffer   │
│             (8 slotentries × 27 bytes)   │
└──────────────────────────────────────────┘

Each slot (setentry):

┌──────────────────┬───────────────────┐
│ hash (8 bytes)   │ key* (8 bytes)    │
└──────────────────┴───────────────────┘

frozenset vs set

set frozenset
Memory structure identical identical
Mutation allowed forbidden (type check)
Hashable no yes (cached in hash field)
Can be dict key no yes

hash Field in PySetObject — Clarification

How it actually works

In the CPython source code (Objects/setobject.c), a single structure PySetObject is used for both types:

typedef struct {
    ob_refcnt
    ob_type
    fill
    used
    mask
    table*
    hash        ← present in the structure for both
    finger
    smalltable[8]
    weakreflist
} PySetObject;

The hash field is physically present in memory for both set and frozenset because the structure is the same.


Difference in Behavior, Not Structure

set frozenset
hash field in memory present present
Initial value -1 (sentinel) -1
When calculated never upon the first call to hash()
Cached not useful yes, written to the field
__hash__ method None (explicitly nullified) calculates and caches

The hash field exists in both objects at the C structure level, but is only used by frozenset. For set, it always remains -1 and is never utilized because __hash__ = None at the type level.