Kafka can feel intimidating. Brokers, partitions, offsets, replicas, consumer groups - you can read about each one separately and still not see how they connect. Every piece is there to solve one problem: moving data from the systems that produce it to the systems that need it, reliably and at scale. We'll work through the whole thing using one example: an online store processing orders.
Imagine a busy online store. Thousands of customers place orders every second. Each order needs to trigger several things: a fraud check, an inventory update, a confirmation email, and an entry in the analytics dashboard.
In a traditional setup, the Order Service would call each of those systems directly, one after another, and wait for every reply. If the email server is slow, the whole order slows down. If one service is down, the order fails. This is called tight coupling, and it does not scale.
Kafka sits in the middle. The Order Service writes the order once and moves on. Every downstream service reads from Kafka at its own pace - they don't coordinate, and none of them wait.
The producer doesn't know consumers exist. Consumers don't know about each other. That one design decision is why the whole system scales without everything grinding to a halt.
A producer is any app that writes data into Kafka - a mobile app, a payment service, a sensor. Each thing it sends is called an event (or a message / record).
When a customer places order ORD-8821, the Order Service creates an event like this:
{ "event": "order.placed", "orderId": "ORD-8821", "userId": "U-4491", "total": 118.99, "currency": "GBP", "ts": "2026-05-27T09:14:02Z" }
The producer fires once and forgets. It returns to the customer in milliseconds instead of waiting for five services.
The producer library handles serialization, choosing a partition, batching, and retries automatically.
Direct calls to 5 services mean 5 chances to fail and latency that stacks up on every order.
A topic is a named channel for a category of events. All order events go into the orders topic; all payments go into the payments topic. Producers choose which topic to write to, and consumers choose which topic to read from.
Each topic is split into partitions. A partition is just an ordered list of messages on one machine. Splitting a topic this way lets Kafka spread the work across many servers - that's the main reason it scales.
How does Kafka decide which partition a message goes into? Through the partition key. Before sending, Kafka takes the key, runs it through a hash function, and uses the result to pick a partition:
// key = userId "U-4491" hash("U-4491") % 3 = partition 1
The same key always hashes to the same partition. So every event for user U-4491 - order placed, then paid, then shipped - lands in the same partition, in the exact order it happened. That's your ordering guarantee per user.
Events for one user stay in order. "Placed" is always read before "shipped". Logic stays correct.
Messages spread round-robin across partitions for even load - fine when order doesn't matter.
Without a key, a consumer might process "cancelled" before "placed" - broken business logic.
A broker is a single Kafka server. A group of them is a cluster. The broker's job: receive messages, assign each one a number, and append it to a file on disk called the commit log.
The broker has no business logic. It doesn't run fraud checks or send emails. It stores data and serves it - that's the whole job. The narrowness is intentional; it's what makes brokers predictable and fast.
Sequential disk writes are actually faster than most people expect - appending to the end of a file skips the slow seek time that random I/O requires. And since the data is on disk, it survives broker restarts and can be replayed at any point.
Every message inside a partition gets a sequential number called an offset. It is not a global unique ID for the order - it is simply the message's position within that one partition, like a line number in a notebook.
Partition 1 - topic: orders offset 0 → U-3312 → order.placed offset 1 → U-1188 → order.placed offset 2 → U-4491 → order.placed ← our order offset 3 → (next write...)
Offset 2 in Partition 1 is a completely different message from offset 2 in Partition 0 - the number is local to each partition. Once written, it never changes and is never reused. A consumer tracks its last committed offset, so after a crash or restart it picks up exactly where it left off.
Partitioning splits different messages across different machines. Replication makes copies of the same partition for safety. They sound related but solve different problems - and mixing them up is probably the most common Kafka misconception.
So when you have a topic with 3 partitions and a replication factor of 3, you end up with 3 lanes of different data, and each lane is copied onto 3 machines. One is for scale, the other is for safety.
Among the copies of each partition, one is the leader and the rest are followers. Producers and consumers only talk to the leader. Followers handle no traffic at all - they silently copy from the leader and stay ready.
The set of replicas fully caught up with the leader is called the ISR (In-Sync Replicas). With acks=all, the broker only confirms the write once the leader and every in-sync follower have stored the message. No silent data loss.
// what happens when fraud writes a result Fraud Service ──writes──▸ leader (fraud-results) ├─ copies to follower 1 └─ copies to follower 2 all in sync ──▸ ACK back to Fraud Service
If the leader crashes, Kafka's controller detects it within seconds and promotes the most up-to-date follower. Producers and consumers reconnect automatically. No manual intervention needed.
A consumer reads messages from a topic and does something with them - a Fraud Service, an Email Service, an Analytics pipeline. The partition is passive storage; the consumer is what actually processes the data.
Consumers that cooperate are organised into a consumer group. Within one group, each partition is handled by exactly one consumer instance, so the work is split and no message is processed twice by the same group.
Both groups read the same partitions, but each tracks its own offset independently. The email group sitting at offset 0 has no effect on the fraud group at offset 3 - separate processes, separate machines, no coordination required.
Nobody tells the Fraud Service a new order arrived. It runs a poll loop - an endless loop asking the broker every ~100ms whether there's anything new. When there is, the next poll returns it.
while (true) { messages = consumer.poll(100ms) for each message: runFraudCheck(message) // do the work commitOffset() // move the bookmark forward }
Kafka doesn't push messages to consumers - consumers pull. A slow consumer can't be overwhelmed because it reads only when it's ready. The offset gets committed after the work completes, not before. So if the consumer crashes mid-processing, the message gets re-read on restart rather than silently dropped.
Any number of consumer groups can read the same orders topic simultaneously - Fraud, Inventory, Email, Analytics, and whatever else you add next month.
Each consumer writes its own result wherever it needs - fraud writes a score to a fraud-results topic, inventory updates its database, email calls an SMTP server. The original message on the orders topic is never modified or deleted; it just sits in the log until retention expires.
Kafka exposes four client APIs:
One more piece worth naming: the cluster needs a "control office" that tracks which brokers are alive and runs leader elections. Older Kafka used ZooKeeper for this; modern Kafka uses a built-in mechanism called KRaft, removing the external dependency.