Tutorial · Data Engineering

~12 min read

Apache Kafka Architecture,
explained in plain English

Kafka can feel intimidating. Brokers, partitions, offsets, replicas, consumer groups - you can read about each one separately and still not see how they connect. Every piece is there to solve one problem: moving data from the systems that produce it to the systems that need it, reliably and at scale. We'll work through the whole thing using one example: an online store processing orders.

producer topic partition offset broker replication consumer group leader / follower

01 The big picture

Imagine a busy online store. Thousands of customers place orders every second. Each order needs to trigger several things: a fraud check, an inventory update, a confirmation email, and an entry in the analytics dashboard.

In a traditional setup, the Order Service would call each of those systems directly, one after another, and wait for every reply. If the email server is slow, the whole order slows down. If one service is down, the order fails. This is called tight coupling, and it does not scale.

Kafka sits in the middle. The Order Service writes the order once and moves on. Every downstream service reads from Kafka at its own pace - they don't coordinate, and none of them wait.

producers cluster consumers

The whole architecture in one line: producers push → cluster stores → consumers pull.

The producer doesn't know consumers exist. Consumers don't know about each other. That one design decision is why the whole system scales without everything grinding to a halt.

02 Producers - who sends the data

A producer is any app that writes data into Kafka - a mobile app, a payment service, a sensor. Each thing it sends is called an event (or a message / record).

When a customer places order ORD-8821, the Order Service creates an event like this:

{
  "event":   "order.placed",
  "orderId": "ORD-8821",
  "userId":  "U-4491",
  "total":   118.99,
  "currency": "GBP",
  "ts":      "2026-05-27T09:14:02Z"
}

Real-life analogy A producer is a passenger boarding a bus. They get on, hand over their ticket (the event), and the bus takes care of the rest. The passenger does not need to know who else is on the bus or where everyone is getting off.

✓ Pro

The producer fires once and forgets. It returns to the customer in milliseconds instead of waiting for five services.

How it works

The producer library handles serialization, choosing a partition, batching, and retries automatically.

✗ Without it

Direct calls to 5 services mean 5 chances to fail and latency that stacks up on every order.

03 Topics & partitions

A topic is a named channel for a category of events. All order events go into the orders topic; all payments go into the payments topic. Producers choose which topic to write to, and consumers choose which topic to read from.

Each topic is split into partitions. A partition is just an ordered list of messages on one machine. Splitting a topic this way lets Kafka spread the work across many servers - that's the main reason it scales.

Real-life analogy A topic is a bus route number (Route 47 → City Centre). A partition is a row of seats on that bus. More rows means more passengers can travel at once.

04 Partition keys & ordering

How does Kafka decide which partition a message goes into? Through the partition key. Before sending, Kafka takes the key, runs it through a hash function, and uses the result to pick a partition:

// key = userId "U-4491"
hash("U-4491") % 3 = partition 1

The same key always hashes to the same partition. So every event for user U-4491 - order placed, then paid, then shipped - lands in the same partition, in the exact order it happened. That's your ordering guarantee per user.

✓ With a key

Events for one user stay in order. "Placed" is always read before "shipped". Logic stays correct.

No key?

Messages spread round-robin across partitions for even load - fine when order doesn't matter.

✗ The risk

Without a key, a consumer might process "cancelled" before "placed" - broken business logic.

05 Brokers & the commit log

A broker is a single Kafka server. A group of them is a cluster. The broker's job: receive messages, assign each one a number, and append it to a file on disk called the commit log.

The broker has no business logic. It doesn't run fraud checks or send emails. It stores data and serves it - that's the whole job. The narrowness is intentional; it's what makes brokers predictable and fast.

Why write to disk instead of memory?

Sequential disk writes are actually faster than most people expect - appending to the end of a file skips the slow seek time that random I/O requires. And since the data is on disk, it survives broker restarts and can be replayed at any point.

Real-life analogy A broker is the bus itself - or more precisely, the conductor with a permanent logbook. Every passenger gets stamped with a seat number, written in ink, never erased.

06 Offsets - the bookmark

Every message inside a partition gets a sequential number called an offset. It is not a global unique ID for the order - it is simply the message's position within that one partition, like a line number in a notebook.

Partition 1 - topic: orders
offset 0 → U-3312 → order.placed
offset 1 → U-1188 → order.placed
offset 2 → U-4491 → order.placed   ← our order
offset 3 → (next write...)

Offset 2 in Partition 1 is a completely different message from offset 2 in Partition 0 - the number is local to each partition. Once written, it never changes and is never reused. A consumer tracks its last committed offset, so after a crash or restart it picks up exactly where it left off.

07 Partition vs replica - two different things

Partitioning splits different messages across different machines. Replication makes copies of the same partition for safety. They sound related but solve different problems - and mixing them up is probably the most common Kafka misconception.

partition (storage lane) replica (safety copy) failure path

Partitioning distributes different data. Replication duplicates the same data.

So when you have a topic with 3 partitions and a replication factor of 3, you end up with 3 lanes of different data, and each lane is copied onto 3 machines. One is for scale, the other is for safety.

08 Leader, follower & ISR replication

Among the copies of each partition, one is the leader and the rest are followers. Producers and consumers only talk to the leader. Followers handle no traffic at all - they silently copy from the leader and stay ready.

The set of replicas fully caught up with the leader is called the ISR (In-Sync Replicas). With acks=all, the broker only confirms the write once the leader and every in-sync follower have stored the message. No silent data loss.

// what happens when fraud writes a result
Fraud Service ──writes──▸ leader (fraud-results)
                           ├─ copies to follower 1
                           └─ copies to follower 2
                           all in sync ──▸ ACK back to Fraud Service

If the leader crashes, Kafka's controller detects it within seconds and promotes the most up-to-date follower. Producers and consumers reconnect automatically. No manual intervention needed.

Real-life analogy The leader is the main warehouse that handles all shipments. The followers are backup warehouses kept perfectly stocked. If the main one burns down, a backup instantly takes over - customers never notice.

09 Consumers & consumer groups

A consumer reads messages from a topic and does something with them - a Fraud Service, an Email Service, an Analytics pipeline. The partition is passive storage; the consumer is what actually processes the data.

Consumers that cooperate are organised into a consumer group. Within one group, each partition is handled by exactly one consumer instance, so the work is split and no message is processed twice by the same group.

storage + fraud group email group

Two layers: passive storage (top) and active consumers (bottom). The email group being behind never slows fraud.

Both groups read the same partitions, but each tracks its own offset independently. The email group sitting at offset 0 has no effect on the fraud group at offset 3 - separate processes, separate machines, no coordination required.

10 The pull model - how consumers "know"

Nobody tells the Fraud Service a new order arrived. It runs a poll loop - an endless loop asking the broker every ~100ms whether there's anything new. When there is, the next poll returns it.

while (true) {
  messages = consumer.poll(100ms)
  for each message:
      runFraudCheck(message)   // do the work
      commitOffset()           // move the bookmark forward
}

Kafka doesn't push messages to consumers - consumers pull. A slow consumer can't be overwhelmed because it reads only when it's ready. The offset gets committed after the work completes, not before. So if the consumer crashes mid-processing, the message gets re-read on restart rather than silently dropped.

11 Fan-out - one event, many readers

Any number of consumer groups can read the same orders topic simultaneously - Fraud, Inventory, Email, Analytics, and whatever else you add next month.

They all run in parallel, not in sequence. Fraud does not wait for Inventory.
No data is duplicated. The message is stored once on disk; each group simply holds its own pointer into that same log.
Add consumers freely. A new analytics or audit service can start reading from offset 0 without touching the Order Service at all.
Replay any time. If Analytics was down for an hour, it just re-reads the messages it missed - thanks to retention (7 days by default), the data is still there.

Each consumer writes its own result wherever it needs - fraud writes a score to a fraud-results topic, inventory updates its database, email calls an SMTP server. The original message on the orders topic is never modified or deleted; it just sits in the log until retention expires.

12 The four Kafka APIs

Kafka exposes four client APIs:

Producer API - publish (write) streams of records to topics. Handles serialization, partitioning, batching and retries.
Consumer API - subscribe to topics and read records in order. Manages group membership and offset commits.
Streams API - a library to read from topics, transform / aggregate / join data in real time, and write results back, with exactly-once semantics.
Connect API (Kafka Connect) - ready-made connectors to move data in and out of Kafka without code (databases, S3, Elasticsearch).

One more piece worth naming: the cluster needs a "control office" that tracks which brokers are alive and runs leader elections. Older Kafka used ZooKeeper for this; modern Kafka uses a built-in mechanism called KRaft, removing the external dependency.

Key takeaways

A producer fires an event and forgets - it never waits for consumers.
A partition key routes one user's events to the same lane, always in order.
A broker only stores and serves data - no business logic, no processing.
An offset is a bookmark (position in a partition), not a global unique ID.
Partitioning spreads different data for scale; replication copies the same data for safety.
Producers and consumers talk only to the leader; followers are a hot standby.
Consumer groups are the active processing layer - they pull, never get pushed to.
Many groups read the same partition independently with no interference - that's fan-out.

Credits & further reading Based on the official Apache Kafka documentation and GeeksforGeeks' Kafka architecture guide. For the full technical reference:

▸ Apache Kafka Documentation - kafka.apache.org/documentation
▸ GeeksforGeeks - Kafka Architecture - geeksforgeeks.org/apache-kafka/kafka-architecture

Apache Kafka® is a registered trademark of the Apache Software Foundation. This article is an independent educational explainer.

Apache Kafka Architecture,explained in plain English