Apache Cassandra Architecture Overview

The Apache Cassandra architecture is built to store enormous volumes of data with scalability, availability, and dependability. This blog article will go over all of Cassandra's architecture components. You will have a basic understanding of the components after reading this post.

The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.

  • All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes.
  • Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
  • When a node goes down, read/write requests can be served from other nodes in the network.
Enthusiastic about exploring the skill set of Cassandra? Then, have a look at the Cassandra Training Course together additional knowledge.

Introduction to Apache Cassandra Architecture

The architecture of Cassandra greatly contributes to its being a database that scales and performs with continuous availability. Rather than using a legacy of RDBMS master-slave or a manual and difficult-to-maintain sharded design, Cassandra has a masterless “ring” distributed architecture that is elegant, and easy to set up and maintain.

Cassandra sports a masterless “ring” architecture.

Cassandra sports a masterless “ring” architecture.

In Cassandra, all nodes are the same; there is no concept of a master node, with all nodes communicating with each other via a gossip protocol.

Cassandra’s built-for-scale architecture means that it is capable of handling large amounts of data and thousands of concurrent users/operations per second, across multiple data centers, as easily as it can manage much smaller amounts of data and user traffic. To add more capacity, you simply add new nodes in an online fashion to an existing cluster.

Cassandra’s architecture also means that, unlike other master-slave or sharded systems, it has no single point of failure and therefore offers true continuous availability and uptime.

MindMajix Youtube Channel

Architecture layers

Core LayerMiddle LayerTop Layer
Messaging service Commit log Tombstones 
Gossip Failure detection Memtable Hinted handoff 
Cluster state SSTableRead repair 
Partitioner Indexes Bootstrap 
ReplicationCompactionMonitoring 
  Admin tools

Key structures

** Node

Where you store your data. It is the basic infrastructure component of Cassandra.

** Data center

A collection of related nodes. A data center can be a physical data center or virtual data center. Different workloads should use separate data centers, either physical or virtual. Replication is set by data center. Using separate data centers prevents Cassandra transactions from being impacted by other workloads and keeps requests close to each other for lower latency. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations.

Check Out Cassandra Tutorials

** Cluster

A cluster contains one or more data centers. It can span physical locations.

** Commit log

All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.

** Table

A collection of ordered columns fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name.

** SSTable

A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

Writing and Reading Data

One of Cassandra’s hallmarks is its fast I/O operation capability for both writing and reading data.

Data is written to Cassandra in a way that provides both full data durability and high performance. From a high level perspective, data written to a Cassandra node is first recorded in a commit log and then written to a memory-based structure called a memtable. When a memtable’s size exceeds a configurable threshold, the data is flushed to disk and written to an SStable (sorted strings table), which is immutable.

The Cassandra write path.

The Cassandra write path.

Because of the way Cassandra writes data, many SStables can exist for a single Cassandra table/column family. A process called compaction for a node occurs on a periodic basis that coalesces multiple SStables into one for faster read access.

Reading data from Cassandra involves a number of processes that can include various memory caches and other mechanisms designed to produce fast read response times.

Frequently asked Cassandra Interview Questions & Answers

For a read request, Cassandra consults a bloom filter that checks the probability of a table having the needed data. If the probability is good, Cassandra checks a memory cache that contains row keys and either finds the needed key in the cache and fetches the compressed data on disk, or locates the needed key and data on disk and then returns the required result set.

The Cassandra read path.

The Cassandra read path.

Data Distribution and Replication

In Cassandra, data distribution and replication go together. Data is organized by table and identified by a primary key, which determines which node the data is stored on. Replicas are copies of rows. When data is first written, it is also referred to as a replica.

Factors influencing replication include:

  • Virtual nodes: assigns data ownership to physical machines.
  • Partitioner: partitions the data across the cluster.
  • Replication strategy: determines the replicas for each row of data.
  • Snitch: defines the topology information that the replication strategy uses to place replicas.

Automatic Data Distribution

Automatic Data DistributionCassandra provides automatic data distribution across all nodes that participate in a ring or database cluster. There is nothing programmatic that a developer or administrator needs to do or code to distribute data across a cluster because data is transparently partitioned across all nodes in a cluster.

Replication Basics

Cassandra also replicates data according to the chosen replication strategy. The replication strategy determines placement of the replicated data.  There are two main replication strategies used by Cassandra, Simple Strategy and the Network Topology Strategy. The first replica for the data is determined by the partitioner. The placement of the subsequent replicas is determined by the replication strategy. The simple strategy places the subsequent replicas on the next node in a clockwise manner. The network topology strategy works well when Cassandra is deployed across data centres. The  network topology strategy is data centre aware and makes sure that replicas are not stored on the same rack. Cassandra uses snitches to discover the overall network topology. This information is used to efficiently route inter-node requests within the bounds of the replica placement strategy.

The replication option is to specify the Replica Placement strategy and the number of replicas wanted. The following table lists all the replica placement strategies.

Strategy nameDescription
Simple StrategySpecifies a simple replication factor for the cluster.
Network Topology StrategyUsing this option, you can set the replication factor for each data-center independently.
Old Network Topology StrategyThis is a legacy replication strategy.

Using this option, you can instruct Cassandra whether to use commitlog for updates on the current KeySpace. This option is not mandatory and by default, it is set to true.

Multi-Data Center and Cloud Support

A very popular aspect of Cassandra’s replication is its support for multiple data centers and cloud availability zones. Many users deploy Cassandra in a multi-data center and cloud availability zone manner to ensure constant uptime for their applications and to supply fast read/write data access in localized regions.

You can easily set up replication so that data is replicated across many data centers with users being able to read and write to any data center they choose and the data being automatically synchronized across all centers.

You can also choose how many copies of your data exist in each data center (e.g. 2 copies in data center 1; 3 copies in data center 2, etc.) Hybrid deployments of part onpremise data centers and part cloud are also supported.

Cassandra supports multi-data center and cloud deployments.

Cassandra supports multi-data center and cloud deployments.

 

Explore Cassandra Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!
Course Schedule
NameDates
Cassandra TrainingNov 02 to Nov 17View Details
Cassandra TrainingNov 05 to Nov 20View Details
Cassandra TrainingNov 09 to Nov 24View Details
Cassandra TrainingNov 12 to Nov 27View Details
Last updated: 04 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less