Kafka - Core Concepts, Use Cases, and Scaling Strategies
Kafka is a popular event streaming platform and it can be used either as a Message Queue or as a Stream processing system.
Key Terminology (Bottom Up)
Offset
A unique identifier for a record within a Kafka partition. Offsets are used to track the position of a consumer in the log and ensure that records are processed in order. Consumers periodically commit offsets to Kafka because, in case of failure or restarts of consumers, they can start from where they left off using the offset.
Partition
A partition is a physical single log on disk that stores a portion of the data in a Kafka topic. Each partition is ordered, and messages within a partition are assigned a unique sequential ID called an offset. It's an immutable append-only log file which makes writes fast.
Topic
A Topic is a logical grouping of partitions that holds the same type of events.
Producer
An application that writes or publishes records (events/messages) to one or more Kafka topics.
Consumer
An application that reads or subscribes to records (events/messages) from one or more Kafka topics.
Consumer Group
A group of consumers that work together to consume records from a Kafka topic. Each consumer in the group is assigned a partition during startup. No two consumers within the same group read from the same partition at the same time.
Broker
A Kafka server that stores data and serves client requests. Brokers receive records from producers, assign offsets to them, and store them in the correct partition on disk. They also serve records to consumers.
ZooKeeper
A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Kafka uses ZooKeeper to manage and coordinate the Kafka brokers.
Kafka Cluster
A Kafka Cluster is a collection of multiple Kafka brokers that work together to manage the distribution, storage, and processing of data across a distributed system. Each broker within the cluster handles a portion of the data (organized into partitions), and together, they ensure high availability, fault tolerance, and scalability. The cluster allows Kafka to manage large volumes of data by spreading the load across multiple servers, ensuring that the system remains resilient even if individual brokers fail. The Kafka cluster acts as the backbone of the Kafka ecosystem, enabling efficient data streaming and processing in distributed environments.
Replication
In order to ensure durability and availability, Kafka replicates partitions using a leader and follower mechanism where each partition will have a leader and a group of followers (depends on replication factor). Followers only act as a backup if needed, so events will be processed mainly on the leader partition. Kafka ensures partitions for a given topic are spread across different brokers so that when a given broker is down, partitions in another broker can still serve the events. (See below image) Image credit: Hello Interview
Usage
1. As Message Queue:
Kafka can be configured as a message queue when you need to process tasks asynchronously and exactly once. In this setup, you create a single Consumer Group that distributes tasks across multiple worker nodes (consumers), each assigned to different partitions. This ensures that tasks are processed efficiently and only once.
2. As Stream:
When dealing with events that need to be processed by various consumers, each with a distinct purpose, Kafka functions as a stream processing system. Here, you define multiple Consumer Groups, each tailored to a specific processing task.
Scalability
1. Aim for <1MB per message and do not put blob data in the message. Instead, upload blob to S3 and put S3 URL in the message.
2. One broker up to 1TB data & 10k messages per second. (If your use case is less than this then you don't really need to scale)
How to scale?
1. Add more brokers.
2. Choose a good partition key.
3. Use managed services like Confluent Cloud Kafka, AWS MSK.
How to handle hot partitions?
1. Remove Key (if you don’t need ordering, just remove the key and Kafka will use round robin to assign partitions).
2. Compound Key - key:random_valuewithinrange(1,10)
or append userID
to key (Again, no ordering is supported if you go this route).
3. Backpressure - Slow down the producer.
Fault Tolerance and Durability (See above image)
Relevant Settings:
1. Replication factor (3 default)
2. Acks (acks=all - maximum durability, tradeoff here is performance) (acks = 2 slightly less durability but high performance)
What happens if a consumer goes down?
Rebalancing: Kafka rebalances the partitions across the active consumers, resulting in assigning the partition of the failed consumer.
When a down consumer is back up, rebalancing occurs again and the consumer uses the last committed offset of the newly assigned partition and starts reading from there.
Note: Kafka does not guarantee assigning back the same partition during rebalancing when a consumer comes back up. Regardless of which partitions the consumer is assigned after restarting, it will begin consuming messages from the last committed offset for each assigned partition.
Errors and Retries
Producer Retries
Kafka producer API supports retries to configure the number of retries and wait time when failed to write to a Kafka topic.
Consumer Retries
1. Create a Retry Topic: Establish a separate topic, known as the retry topic, to handle failed events during consumer processing. When an event fails, publish it to this retry topic along with the current retry count.
2. Process Failed Events: Set up a dedicated consumer to process events from the retry topic. If the retry count is within the defined limit, attempt to reprocess the event. If the event fails again, increment the retry count and republish it to the retry topic. If the retry count exceeds the limit, move the event to a Dead Letter Queue (DLQ) topic for further analysis and troubleshooting. Usually, DLQ topics do not have any consumers.
Performance Optimizations
1. Batch messages in producer. Send a single request with multiple messages.
2. Compress messages in producer using GZIP, resulting in a small payload size.
Retention Policy
1. Time-Based Deletion (retention.ms
)(default 7 days): If the time limit set by retention.ms
is reached before the log size exceeds retention.bytes
, Kafka will start deleting the oldest messages based on their age.
2. Size-Based Deletion (retention.bytes
)(default 1GB): If the log size reaches the limit set by retention.bytes
before the messages reach the age specified by retention.ms
, Kafka will begin purging the oldest messages to keep the log within the specified size.
In practice, Kafka will delete messages when either of these conditions is met. This means that messages can be removed either because they are too old or because the log size is too large.
References:
- Kafka Deep Dive by Evan King