Today i watched one video on kafka consumers presented by Igor Buzatoic at kafka summit. He demonstrated the working of kafka consumers, group re-balancing and multi-threading environment. The complete video should become available on youtube soon.
So the presentation was divided basically in 2 sections
- Single threaded mechanism
- Multi threaded mechanism
Single Thread
The idea here is that a single thread is responsible to poll the records from kafka and do the processing of those records. Now here the offset commit step can either be executed manually after the processing is done or periodically if we are sure than our delay in processing is less than commit interval. Usually people go with manual commit only as this ensures that we are committing what we have already processed. Now this is simple architecture and doesn't require any synchronization.
Multi Thread
In multi threaded environment, there are 2 possibilities
- Single thread responsible for polling the records, process the records using multiple threads, wait for them to be processed and then commit those offsets.
- Single thread responsible for polling the records, submit the records to multiple threads, commit the offsets which have been processed and do the poll again. Here the main thread is not waiting for the processing to complete before making another poll() call.
Second approach is more robust and has advantages w.r.t. reacting to group re-balancing. In the first approach, if the group re-balances, the consumer won't know until the processing is done, which can take long. But in the second approach, consumer will be notified instantly as it makes poll() call. Reacting to group re-balancing event is crucial to preserve order in a partition.
Igor also told about pausing a partition while its records are being processed so that we don't fetch new records if old ones are still under process. I will have to go to the source code to understand that. But if you have ever used spark streaming, you will notice that spark has implemented the second approach of multi-threaded architecture where driver (main thread) fetches the offsets and is responsible for offset committing and executors (processing threads) just fetch the records (using different group name than the driver) and process those records.
This is it for today. I worked on my project also today and it involved moving my namenode servers from godaddy to aws route 53 because for some reason, the TXT records in godaddy weren't being updated. I will write about it tomorrow.

Comments
Post a Comment