Apache Kafka — Detailed Description

What Kafka Is

Apache Kafka is a distributed event streaming platform, originally developed at LinkedIn and donated to the Apache Software Foundation in 2011. Kafka is neither a traditional message queue nor a database — it belongs to a separate category of systems commonly called a distributed commit log or event log.

The core idea: all events are written into an immutable ordered log, from which consumers read at their own pace without deleting messages.


Core Concepts

Topic

A logical channel into which producers write messages and consumers read them. A topic is an abstraction; physically, it is split into partitions.

Broker

An individual Kafka server. A cluster consists of multiple brokers. Each broker stores part of the partitions from all topics. Brokers do not have a centralized “master” — coordination is handled by KRaft (in modern versions) or ZooKeeper (legacy versions).

Producer

Publishes messages to a topic. Decides which partition to write to (or delegates this to Kafka through a partitioner).

Consumer

Reads messages from a topic. Stores an offset — the position of the last consumed message — either in Kafka (the __consumer_offsets topic) or in external storage.

Offset

A monotonically increasing integer identifier of a message within a partition. There is no global topic offset — only per-partition offsets.


Partitions — Kafka’s Key Mechanism

What a Partition Is

A partition is the physical unit of storage and parallelism. Each partition is an ordered, immutable sequence of messages stored on disk as a set of segment files.

graph LR T["Topic: orders"] --> P0["Partition 0\n[msg0, msg1, msg4...]"] T --> P1["Partition 1\n[msg2, msg5...]"] T --> P2["Partition 2\n[msg3, msg6...]"] P0 --> B1["Broker 1\n(Leader)"] P1 --> B2["Broker 2\n(Leader)"] P2 --> B3["Broker 3\n(Leader)"] P0 -.->|replica| B2 P1 -.->|replica| B3 P2 -.->|replica| B1

Partition Distribution Across Brokers

How a Producer Chooses a Partition

  1. Explicit key: key hash → partition number. The same key always goes to the same partition → ordering guarantee for that key
  2. Without a key (null key): round-robin or sticky partitioning
  3. Custom partitioner: arbitrary routing logic

Ordering Guarantees


Consumer Groups — Parallel Consumption Mechanism

How It Works

A Consumer Group is a logical group of consumers jointly reading one topic. Kafka automatically distributes partitions among group members so that each partition is consumed by exactly one consumer within the group.

graph TD T["Topic: orders\n(6 partitions)"] subgraph "Consumer Group A — order processing" C1["Consumer 1\nP0, P1"] C2["Consumer 2\nP2, P3"] C3["Consumer 3\nP4, P5"] end subgraph "Consumer Group B — analytics" C4["Consumer 4\nP0, P1, P2"] C5["Consumer 5\nP3, P4, P5"] end T --> C1 T --> C2 T --> C3 T --> C4 T --> C5

Key Consumer Group Properties

Rebalancing

When group membership changes, Kafka performs a rebalance.

Strategy Description When to Use
Range Assign partitions by ranges When locality matters
RoundRobin Distribute evenly Uniform consumers
Sticky Minimize movement Stateful consumers
CooperativeSticky Incremental rebalance Recommended in production

Stop-the-world vs Incremental Rebalance

Classic rebalance pauses all consumers.

Cooperative Rebalance (Kafka 2.4+): only moved partitions pause; others continue processing.

Group Coordinator

A dedicated broker responsible for a consumer group. Stores offsets and coordinates rebalances.


Working with Many Partitions

Choosing the Number of Partitions

This is one of the most important architectural decisions.

Factor Impact
Desired throughput More partitions = more parallelism
Number of consumers Partition count should not be lower
Replication factor More files and storage
Rebalance latency More partitions → slower rebalance
Broker memory Each partition consumes memory

Rule of thumb: start with max(target_throughput / single_partition_throughput, num_consumers).

Hot Partitions

Uneven key distribution can overload one partition.

Solutions: - Add random suffixes - Custom partitioners - Pre-aggregation

Log Compaction

Kafka retains only the latest message per key.

Retention

Messages are stored regardless of whether they were consumed.

Deletion policies: - Time (retention.ms) - Size (retention.bytes) - Combination of both


Performance and Internal Mechanics

Why Kafka Is Fast

Sequential disk I/O: append-only writes.

Zero-copy: uses sendfile().

Batching: processes batches instead of individual records.

Page Cache: frequently accessed data stays in RAM.

Compression: batch compression with snappy, lz4, zstd, gzip.

Delivery Guarantees

Mode Description Risk
At most once No retries Message loss
At least once Retry on failure Duplicates
Exactly once Transactions + idempotency More complexity

Idempotent Producer (enable.idempotence=true) prevents duplicates.

Transactions allow atomic multi-topic operations.


Kafka Streams and ksqlDB

Kafka Streams — stream processing library inside the Kafka ecosystem.

ksqlDB — SQL interface built on top of Kafka Streams.


Comparison with Alternatives


{
  "title": {
    "text": "Comparison of Queueing and Streaming Systems",
    "left": "center",
    "top": 20,
    "textStyle": {
      "fontSize": 20,
      "color": "#E5E7EB"
    }
  },

  "tooltip": {
    "trigger": "axis",
    "axisPointer": {
      "type": "shadow"
    }
  },

  "legend": {
    "bottom": 10,
    "textStyle": {
      "fontSize": 13,
      "color": "#CBD5E1"
    },
    "data": [
      "Throughput",
      "Latency (lower = better)",
      "Operational Simplicity",
      "Ordering"
    ]
  },

  "radar": {
    "radius": "72%",
    "center": ["50%", "45%"],

    "name": {
      "textStyle": {
        "fontSize": 14,
        "color": "#D1D5DB"
      }
    },

    "axisLine": {
      "lineStyle": {
        "color": "#64748B"
      }
    },

    "splitLine": {
      "lineStyle": {
        "color": "#475569"
      }
    },

    "splitArea": {
      "show": true,
      "areaStyle": {
        "color": [
          "rgba(51,65,85,0.10)",
          "rgba(51,65,85,0.18)"
        ]
      }
    },

    "indicator": [
      { "name": "Throughput", "max": 10 },
      { "name": "Low latency", "max": 10 },
      { "name": "Operational simplicity", "max": 10 },
      { "name": "Ordering", "max": 10 },
      { "name": "Replay", "max": 10 },
      { "name": "Ecosystem", "max": 10 }
    ]
  },

  "series": [
    {
      "type": "radar",

      "lineStyle": {
        "width": 2
      },

      "data": [
        { "value": [10, 6, 4, 8, 10, 10], "name": "Kafka" },
        { "value": [6, 9, 8, 7, 3, 7], "name": "RabbitMQ" },
        { "value": [9, 5, 6, 8, 10, 6], "name": "Redpanda" },
        { "value": [7, 8, 7, 5, 2, 5], "name": "NATS JetStream" },
        { "value": [8, 7, 5, 6, 8, 4], "name": "Pulsar" }
      ]
    }
  ]
}

Kafka vs RabbitMQ

Characteristic Kafka RabbitMQ
Model Log-based (pull) Message broker (push)
Storage Retention-based Deletes after ACK
Replay Yes No
Throughput Millions/sec Tens of thousands/sec
Latency 5–15 ms <1 ms possible
Routing Topic/partition only Advanced routing
Ordering Per partition Best effort
Complexity Higher Lower
Best for Analytics, event sourcing Task queues, RPC

Kafka vs Redpanda

Characteristic Kafka Redpanda
Language Java/Scala C++
Coordination KRaft Built-in Raft
Latency 5–15 ms 1–3 ms
Deployment More complex Simpler
Ecosystem Huge Growing
Maturity Battle-tested Younger

Kafka vs Apache Pulsar

Characteristic Kafka Pulsar
Storage Architecture Broker-based Compute/storage separation
Scaling Requires rebalance Seamless
Multi-tenancy Limited Native
Geo-replication MirrorMaker Built-in
Subscriptions Consumer Groups Multiple modes
Complexity High Very high

Kafka vs NATS JetStream

Characteristic Kafka NATS JetStream
Latency 5–15 ms <1 ms
Simplicity Complex Minimal
Throughput Higher Lower
Ecosystem Rich Smaller
Storage Long-term Less scalable

When Kafka Is the Right Choice

✅ Kafka fits when: - Very high throughput - Event sourcing - Multiple independent consumer groups - Real-time analytics - Long-term event storage - Heterogeneous integrations

❌ Kafka may not fit when: - You need latency below 1 ms - Simple task queues are enough - Complex routing is required - Team lacks operational expertise - Message volume is small