Python `dict` — Internal Structure and Rehashing

General Structure (CPython 3.6+)

Since Python 3.6, dict is implemented as a compact hash table. The storage is divided into two parts:

┌─────────────────────────────────────┐
│  indices[]  — "sparse" index array  │  ← hash table (sparse)
├─────────────────────────────────────┤
│  entries[]  — "dense" entry array   │  ← dense, in insertion order
└─────────────────────────────────────┘

`indices[]` — Sparse Index Array

indices = [EMPTY, 2, EMPTY, 0, 1, EMPTY, EMPTY, 3]
           -1     ↑         ↑   ↑         
                  │         │   └── index in entries[]
                  └─────────┘

Size is always a power of two (8, 16, 32 ...)
Element type is selected based on size: int8 / int16 / int32 / int64
Values: -1 (EMPTY), -2 (DUMMY/deleted), or an index in entries[]

`entries[]` — Dense Array (insertion-ordered!)

Each element is a structure consisting of three fields:

// Objects/dictobject.c
typedef struct {
    Py_hash_t me_hash;   // cached hash of the key (Py_ssize_t)
    PyObject *me_key;    // pointer to the key
    PyObject *me_value;  // pointer to the value
} PyDictKeyEntry;

Key Lookup

d["foo"]

1. hash_val = hash("foo")
2. slot = hash_val & (len(indices) - 1)   # bitmask (faster than mod)
3. idx   = indices[slot]
4. if idx == EMPTY → KeyError
5. if entries[idx].me_hash == hash_val
        AND entries[idx].me_key is key     # check by identity (fast)
        OR  entries[idx].me_key == key     # fallback to equality (slow)
   → found
6. otherwise → probe (iterate through next slots)

Probing Algorithm — Perturbation Probing (not linear!)

// Simplified from dictobject.c
perturb = hash_val;
while (True):
    slot = (slot * 5 + perturb + 1) & mask
    perturb >>= PERTURB_SHIFT   // PERTURB_SHIFT = 5
    // check indices[slot]

This is not linear probing and not double hashing — the formula guarantees visiting all slots in the table for any power-of-2 size.

Rehashing (Resize / Rehash)

When does it trigger?

# Threshold: load factor > 2/3
# Specifically: used > (size * 2) // 3
# where used = number of occupied + deleted (dummy) slots

size=8  → resize when used > 5
size=16 → resize when used > 10
size=32 → resize when used > 21

Size after Resize

// minused = number of actual live keys
// Find the smallest power of two > minused * 4 (if dict is small)
// or > minused * 2 (if large, > 50,000)

new_size = 8
while new_size <= minused * growth_factor:
    new_size <<= 1

Growth factor for small dicts: × 4
Growth factor for large dicts: × 2

What happens during rehashing?

OLD TABLE                   NEW TABLE
indices[8]                  indices[32]   (new empty array)
entries[5 entries]          entries[]     (same objects!)

Steps:
1. Allocate new indices[] of the required size, fill with EMPTY
2. Iterate through entries[] (dense array — NO gaps!)
3. For each entry: new slot = me_hash & new_mask
4. Insert index into the new indices[] (with probing on collisions)
5. entries[] is NOT copied — only indices[] is recalculated!

Key advantage of the compact format: resizing doesn't require moving the data itself — only reindexing indices[]. Insertion order is preserved automatically.

Deletion (Why there's no immediate shrink)

d = {i: i for i in range(100)}
del d[50]  # what happens?

indices[slot] = DUMMY (-2)   ← "tombstone", not EMPTY!
entries[idx].me_key   = NULL
entries[idx].me_value = NULL

Dummy is needed because otherwise probing chains would be broken:

Insert:  A→slot5, B→slot5(collision)→slot7
Delete A: slot5=EMPTY
Search B:  slot5==EMPTY → KeyError!  ← INCORRECT without dummy

Resize down happens only during the next insertion if:

(used - dummy) * 4 <= size

Special Optimizations

Split-table for Object Attributes

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

p1 = Point(1, 2)
p2 = Point(3, 4)

Instances of the same class share ONE indices[] + keys[]
and have their own values[]

p1.__dict__ → keys=[shared], values=[1, 2]
p2.__dict__ → keys=[shared], values=[3, 4]

Unicode Keys (dict with str keys)

If all keys are str, CPython uses an optimized lookup (lookdict_unicode) that skips type checks.

Final Workflow

+flowchart TD A[dict.__setitem__] --> B{load factor\n> 2/3?} B -- No --> C[hash key] B -- Yes --> R[dictresize] R --> R1[calculate new_size\nmin power of 2] R1 --> R2[new indices\nof required int type] R2 --> R3[iterate entries\nreindex] R3 --> C C --> D[slot = hash & mask] D --> E{indices slot\nEMPTY?} E -- Yes --> F[write to entries\nupdate indices] E -- No --> G{collision\nor dummy?} G --> H[perturb probe\nnext slot] H --> E

Brief Summary

Aspect	Details
Structure	`indices[]` (sparse) + `entries[]` (dense)
Table Size	Power of 2, minimum 8
Resize Threshold	`used > size * 2/3`
New Size	× 4 (small) / × 2 (> 50k entries)
Probing	Perturbation (`slot = slot*5 + perturb + 1`)
Deletion	DUMMY slots, no immediate shrink
Order Guarantee	Insertion order (official since 3.7)
Lookup Complexity	O(1) amortized

Python dict — Internal Structure and Rehashing