Apache Kafka — Detailed Description

What Kafka Is

Apache Kafka is a distributed event streaming platform, originally developed at LinkedIn and donated to the Apache Software Foundation in 2011. Kafka is neither a traditional message queue nor a database — it belongs to a separate category of systems commonly called a distributed commit log or event log.

The core idea: all events are written into an immutable ordered log, from which consumers read at their own pace without deleting messages.

Core Concepts

Topic

A logical channel into which producers write messages and consumers read them. A topic is an abstraction; physically, it is split into partitions.

Broker

An individual Kafka server. A cluster consists of multiple brokers. Each broker stores part of the partitions from all topics. Brokers do not have a centralized “master” — coordination is handled by KRaft (in modern versions) or ZooKeeper (legacy versions).

Producer

Publishes messages to a topic. Decides which partition to write to (or delegates this to Kafka through a partitioner).

Consumer

Reads messages from a topic. Stores an offset — the position of the last consumed message — either in Kafka (the __consumer_offsets topic) or in external storage.

Offset

A monotonically increasing integer identifier of a message within a partition. There is no global topic offset — only per-partition offsets.

Partitions — Kafka’s Key Mechanism

What a Partition Is

A partition is the physical unit of storage and parallelism. Each partition is an ordered, immutable sequence of messages stored on disk as a set of segment files.

graph LR T["Topic: orders"] --> P0["Partition 0\n[msg0, msg1, msg4...]"] T --> P1["Partition 1\n[msg2, msg5...]"] T --> P2["Partition 2\n[msg3, msg6...]"] P0 --> B1["Broker 1\n(Leader)"] P1 --> B2["Broker 2\n(Leader)"] P2 --> B3["Broker 3\n(Leader)"] P0 -.->|replica| B2 P1 -.->|replica| B3 P2 -.->|replica| B1

Partition Distribution Across Brokers

Each partition has one leader and N−1 followers, where N = replication factor
All read and write operations go through the leader
Replicas copy data from the leader synchronously or asynchronously depending on configuration
The set of replicas synchronized with the leader is called ISR (In-Sync Replicas)
If the leader fails, one ISR replica automatically becomes the new leader

How a Producer Chooses a Partition

Explicit key: key hash → partition number. The same key always goes to the same partition → ordering guarantee for that key
Without a key (null key): round-robin or sticky partitioning
Custom partitioner: arbitrary routing logic

Ordering Guarantees

Ordering is guaranteed only within a single partition
There is no ordering across partitions
This is a fundamental tradeoff: more partitions = more parallelism, but no global ordering

Consumer Groups — Parallel Consumption Mechanism

How It Works

A Consumer Group is a logical group of consumers jointly reading one topic. Kafka automatically distributes partitions among group members so that each partition is consumed by exactly one consumer within the group.

graph TD T["Topic: orders\n(6 partitions)"] subgraph "Consumer Group A — order processing" C1["Consumer 1\nP0, P1"] C2["Consumer 2\nP2, P3"] C3["Consumer 3\nP4, P5"] end subgraph "Consumer Group B — analytics" C4["Consumer 4\nP0, P1, P2"] C5["Consumer 5\nP3, P4, P5"] end T --> C1 T --> C2 T --> C3 T --> C4 T --> C5

Key Consumer Group Properties

Different groups are independent — each stores offsets separately
The number of active consumers in a group is limited by partition count
Optimal parallelism: consumers count = partitions count

Rebalancing

When group membership changes, Kafka performs a rebalance.

Strategy	Description	When to Use
Range	Assign partitions by ranges	When locality matters
RoundRobin	Distribute evenly	Uniform consumers
Sticky	Minimize movement	Stateful consumers
CooperativeSticky	Incremental rebalance	Recommended in production

Stop-the-world vs Incremental Rebalance

Classic rebalance pauses all consumers.

Cooperative Rebalance (Kafka 2.4+): only moved partitions pause; others continue processing.

Group Coordinator

A dedicated broker responsible for a consumer group. Stores offsets and coordinates rebalances.

Working with Many Partitions

Choosing the Number of Partitions

This is one of the most important architectural decisions.

Factor	Impact
Desired throughput	More partitions = more parallelism
Number of consumers	Partition count should not be lower
Replication factor	More files and storage
Rebalance latency	More partitions → slower rebalance
Broker memory	Each partition consumes memory

Rule of thumb: start with max(target_throughput / single_partition_throughput, num_consumers).

Hot Partitions

Uneven key distribution can overload one partition.

Solutions: - Add random suffixes - Custom partitioners - Pre-aggregation

Log Compaction

Kafka retains only the latest message per key.

Retention

Messages are stored regardless of whether they were consumed.

Deletion policies: - Time (retention.ms) - Size (retention.bytes) - Combination of both

Performance and Internal Mechanics

Why Kafka Is Fast

Sequential disk I/O: append-only writes.

Zero-copy: uses sendfile().

Batching: processes batches instead of individual records.

Page Cache: frequently accessed data stays in RAM.

Compression: batch compression with snappy, lz4, zstd, gzip.

Delivery Guarantees

Mode	Description	Risk
At most once	No retries	Message loss
At least once	Retry on failure	Duplicates
Exactly once	Transactions + idempotency	More complexity

Idempotent Producer (enable.idempotence=true) prevents duplicates.

Transactions allow atomic multi-topic operations.

Kafka Streams and ksqlDB

Kafka Streams — stream processing library inside the Kafka ecosystem.

ksqlDB — SQL interface built on top of Kafka Streams.

Comparison with Alternatives

{
  "title": {
    "text": "Comparison of Queueing and Streaming Systems",
    "left": "center",
    "top": 20,
    "textStyle": {
      "fontSize": 20,
      "color": "#E5E7EB"
    }
  },

  "tooltip": {
    "trigger": "axis",
    "axisPointer": {
      "type": "shadow"
    }
  },

  "legend": {
    "bottom": 10,
    "textStyle": {
      "fontSize": 13,
      "color": "#CBD5E1"
    },
    "data": [
      "Throughput",
      "Latency (lower = better)",
      "Operational Simplicity",
      "Ordering"
    ]
  },

  "radar": {
    "radius": "72%",
    "center": ["50%", "45%"],

    "name": {
      "textStyle": {
        "fontSize": 14,
        "color": "#D1D5DB"
      }
    },

    "axisLine": {
      "lineStyle": {
        "color": "#64748B"
      }
    },

    "splitLine": {
      "lineStyle": {
        "color": "#475569"
      }
    },

    "splitArea": {
      "show": true,
      "areaStyle": {
        "color": [
          "rgba(51,65,85,0.10)",
          "rgba(51,65,85,0.18)"
        ]
      }
    },

    "indicator": [
      { "name": "Throughput", "max": 10 },
      { "name": "Low latency", "max": 10 },
      { "name": "Operational simplicity", "max": 10 },
      { "name": "Ordering", "max": 10 },
      { "name": "Replay", "max": 10 },
      { "name": "Ecosystem", "max": 10 }
    ]
  },

  "series": [
    {
      "type": "radar",

      "lineStyle": {
        "width": 2
      },

      "data": [
        { "value": [10, 6, 4, 8, 10, 10], "name": "Kafka" },
        { "value": [6, 9, 8, 7, 3, 7], "name": "RabbitMQ" },
        { "value": [9, 5, 6, 8, 10, 6], "name": "Redpanda" },
        { "value": [7, 8, 7, 5, 2, 5], "name": "NATS JetStream" },
        { "value": [8, 7, 5, 6, 8, 4], "name": "Pulsar" }
      ]
    }
  ]
}

Kafka vs RabbitMQ

Characteristic	Kafka	RabbitMQ
Model	Log-based (pull)	Message broker (push)
Storage	Retention-based	Deletes after ACK
Replay	Yes	No
Throughput	Millions/sec	Tens of thousands/sec
Latency	5–15 ms	<1 ms possible
Routing	Topic/partition only	Advanced routing
Ordering	Per partition	Best effort
Complexity	Higher	Lower
Best for	Analytics, event sourcing	Task queues, RPC

Kafka vs Redpanda

Characteristic	Kafka	Redpanda
Language	Java/Scala	C++
Coordination	KRaft	Built-in Raft
Latency	5–15 ms	1–3 ms
Deployment	More complex	Simpler
Ecosystem	Huge	Growing
Maturity	Battle-tested	Younger

Kafka vs Apache Pulsar

Characteristic	Kafka	Pulsar
Storage Architecture	Broker-based	Compute/storage separation
Scaling	Requires rebalance	Seamless
Multi-tenancy	Limited	Native
Geo-replication	MirrorMaker	Built-in
Subscriptions	Consumer Groups	Multiple modes
Complexity	High	Very high

Kafka vs NATS JetStream

Characteristic	Kafka	NATS JetStream
Latency	5–15 ms	<1 ms
Simplicity	Complex	Minimal
Throughput	Higher	Lower
Ecosystem	Rich	Smaller
Storage	Long-term	Less scalable

When Kafka Is the Right Choice

✅ Kafka fits when: - Very high throughput - Event sourcing - Multiple independent consumer groups - Real-time analytics - Long-term event storage - Heterogeneous integrations

❌ Kafka may not fit when: - You need latency below 1 ms - Simple task queues are enough - Complex routing is required - Team lacks operational expertise - Message volume is small