Apache Pulsar Architecture Overview

Rating: 5
  
 
177

Pulsar was developed by Yahoo in 2013, and in 2016, it was open-sourced for the first time. Since then, Pulsar has gained a lot of popularity, and it has become the default choice of many organizations. Some of its key features include: 

  • It has very low published and low latency. 
  • Message seamlessly with the help of Apache BookKeeper. Storage is not a problem with BookKeeper.
  • It has Geo-replication, so even while working, you can have the same working experience irrespective of your location.
  • It can handle billions of topics in a day. 

When we discuss the performance of Apache Pulsar we mostly talk about its low latency and high throughput. However, the architecture and configuration of Pulsar are responsible for its high performance. In this blog, we will understand its Architecture thoroughly.

If you want to enrich your career and become a professional in Apache Spark, then enroll in "Apache Spark Training" - This course will help you to achieve excellence in this domain.

Apache Pulsar Architecture Overview

In Apache Pulsar, at the highest instance, multiple Pulsar clusters are available, which distribute data and tasks among themselves equally. 

Let's learn about it further:

  • Thanks to its Segment based architecture, it ensures smooth receiving of messages(even then, you can consume messages without worrying about the space). In such situations, you can add multiple bookies or BookKeeper Cluster, and they will replicate data. This ensures seamless working, and there is no risk of data loss too.
  • A ZooKeeper manages task coordination among multiple Pulsar clusters.

Furthermore, the multiple Pulsar clusters are responsible for task coordination. Many functions such as Geo-replication, message replication, and many more involve multiple clusters.

Brokers

The big reason behind its popularity is due to the Stateless Brokers. These brokers are competent enough to start immediately to process higher demand. The broker is called "Stateless" because it doesn't store any messaging data. Aforesaid, messages are stored in Apache BookKeeper. We'll talk about BookKeeper further.  Pulsar assigns each topic partition to each broker. The broker to whom the Topic partition is assigned is called as Owner broker of that particular topic partition. Producers and consumers in Pulsar connect to the required owner broker of a topic partition to consume and produce messages.

If a broker fails to do so, Pulsar moves the topic partition that was owned by it to the remaining brokers, which are available in the cluster automatically. One thing that needs to be cleared; the ownership of a broker is moved to another broker when the topic is moved to a different broker. And no data is replicated during this period. 

Metadata store

The metadata store collects all the data of clusters. It collects topics such as schema, broker load data, and so on. For things such as Metadata storage, cluster configuration, and coordination, Pulsar uses ZooKeeper. Each cluster has its own ZooKeeper to collect cluster-specific configuration and coordination such as metadata, ownership, BookKeeper ledger data, and much more.

Apache BookKeeper

Pulsar uses a system called Apache BookKeeper to store and manage the messages. BookKeeper is a distributed system that provides a number of significant benefits: 

  • BookKeeper allows Pulsar to use independent logs, called ledgers. Depending upon the topic, multiple ledgers can be created.
  • For message replication, it offers storage efficiently to handle sequential data. 
  • They even provide the distribution of I/O across bookies.
  • In both throughput and capacity, it is horizontally scalable. You can increase the capacity by adding more bookies to a cluster.
  • BookKeepers are designed to manage hundreds and thousands of ledgers with simultaneous reads and writes. They use multiple disk devices, one for journal and the other one for general storage. Bookies this way isolate the effect of reading operation from the latency.

MindMajix Youtube Channel

Ledgers

A Ledger is an append-only data configuration with a sole writer which is assigned to many bookies storages nodes. They are replicated to numerous bookies. A pulsar broker is responsible for creating a ledger, appending entries to the ledger, and closing the ledger. Moreover, after the closing of the ledger—due to the writer process crash or explicitly. Then it can be opened in reading mode only. Later all the entries, that is, the whole ledger can be deleted if they are not needed anymore. 

Segment-centric Storage

Segment-centric Storage is the best function of Pulsar. Due to this, many storage problems are resolved now. We know that Pulsar has a layered architecture and segment-centric storage is the two key designs of Pulsar. Check out the benefits of them:

  • Unbounded topic partition Because they are broken down into segments and stored in distributed Apache BookKeeper, therefore, the capacity of a topic partition is not limited to the capacity of discs. 
  • Timely scalability without any data rebalancing provides 
    1. Swift multiple cluster expansion
    2. Instant bookie failure recovery
    3. Seamless broker failure recovery

These are some of the key benefits of segment-centric storage.

Preparing for Apache Spark Interview? Here’s Top Apache Spark Interview Questions and Answers

Configuration store

All the configurations of pulsar instances such as cluster, tenants, namespaces, partitioned, and so on are stored in the configuration store. Moreover, a Pulsar instance can have multiple local clusters and single local clusters or multiple cross-region clusters. Also, the configuration store can share these configurations across all the clusters under the Pulsar instance. The configuration store can be deployed on a separate ZooKeeper cluster or an existing ZooKeeper cluster.

Persistent storage

The core benefits of using Pulsar lie in its architecture; Pulsar provides guaranteed message delivery. If a message has reached the broker successfully, then it will be delivered to its intended target. 

The guaranteed messages require that are non-acknowledged messages are stored in a solid manner until they can be delivered and acknowledged by the consumer. This mode of messaging is what we call—Persistent Storage. 

Conclusion

By far now, you must have understood the whole architecture of Apache Pulsar. However, you must keep your needs in mind before working with Pulsar. Moreover, you should try all the good options that are available because practical experience is better than theoretical knowledge. According to us, Pulsar is the best option right now in the market; it has a lot of benefits which can ease your work a lot.

Join our newsletter
inbox

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
NameDates
Apache Ambari TrainingNov 22 to Dec 07
Apache Ambari TrainingNov 26 to Dec 11
Apache Ambari TrainingNov 29 to Dec 14
Apache Ambari TrainingDec 03 to Dec 18
Last updated: 21 November 2022
About Author
Madhuri Yerukala

Madhuri is a Senior Content Creator at MindMajix. She has written about a range of different topics on various technologies, which include, Splunk, Tensorflow, Selenium, and CEH. She spends most of her time researching on technology, and startups. Connect with her via LinkedIn and Twitter .

Recommended Courses

1 /15