Skip to main content

Kafka consumers


 

Today i watched one video on kafka consumers presented by Igor Buzatoic at kafka summit. He demonstrated the working of kafka consumers, group re-balancing and multi-threading environment. The complete video should become available on youtube soon.

So the presentation was divided basically in 2 sections

- Single threaded mechanism

- Multi threaded mechanism

Single Thread

The idea here is that a single thread is responsible to poll the records from kafka and do the processing of those records. Now here the offset commit step can either be executed manually after the processing is done or periodically if we are sure than our delay in processing is less than commit interval. Usually people go with manual commit only as this ensures that we are committing what we have already processed. Now this is simple architecture and doesn't require any synchronization.

Multi Thread

In multi threaded environment, there are 2 possibilities

- Single thread responsible for polling the records, process the records using multiple threads, wait for them to be processed and then commit those offsets.

- Single thread responsible for polling the records, submit the records to multiple threads, commit the offsets which have been processed and do the poll again. Here the main thread is not waiting for the processing to complete before making another poll() call.

Second approach is more robust and has advantages w.r.t. reacting to group re-balancing. In the first approach, if the group re-balances, the consumer won't know until the processing is done, which can take long. But in the second approach, consumer will be notified instantly as it makes poll() call. Reacting to group re-balancing event is crucial to preserve order in a partition.

Igor also told about pausing a partition while its records are being processed so that we don't fetch new records if old ones are still under process. I will have to go to the source code to understand that. But if you have ever used spark streaming, you will notice that spark has implemented the second approach of multi-threaded architecture where driver (main thread) fetches the offsets and is responsible for offset committing and executors (processing threads) just fetch the records (using different group name than the driver) and process those records.

This is it for today. I worked on my project also today and it involved moving my namenode servers from godaddy to aws route 53 because for some reason, the TXT records in godaddy weren't being updated. I will write about it tomorrow.


Comments

Popular posts from this blog

Why applications are moving to decentralized network ?

Fig 1. Decentralized Network Recently, I started studying blockchain in order to get certified from blockchain-council. I am in the middle of my training and have started looking into some real-world applications of blockchain. Some problems the applications are trying to solve are decentralized storage, decentralized voting/consensus, decentralized money and many more. The ultimate goal is to take whatever we have built so far and put it over a decentralized network. To achieve this developers are building  custom applications on top of bitcoin blockchain. When I thought about this goal, I couldn't help but think that is this not a waste of time. We spent last 30 years to build the centralized applications and now people are building something from the scratch to put the centralized applications over a decentralized network. Are people doing it just for fun or just because they can so they are. I will share what have I learnt from my 20 days of training, blogs, articles and whit...

Decentralized Applications

DApp: Decentralized Application In the previous post, I discussed why people should move to decentralized applications. In this post, it would be better to understand what a decentralized application is and what is the backbone of all major DApps in the market. To understand a decentralized application, you will have to understand the decentralized network. In a decentralized network, there is no single entity governing the operations of the network. All machines/nodes participating in the network are equally responsible for the state and security of the network. In the above diagram, as you can see, nodes A, B and C are the ones keeping the network running. Those nodes are running the application and no single node can modify the behavior of the application for the connected clients. This means that even when node A stops working and its connected clients connect to node B or node C, they don't notice a difference. When we talk about decentralized network, then we implicitly mean ...