This tutorial gives you an overview and talks about the fundamentals of Apache Cassandra.
Cassandra is a fully distributed, masterless database, offering superior scalability and fault tolerance to traditional single master databases. Compared with other popular distributed databases like Riak, HBase, and Voldemort, Cassandra offers a uniquely robust and expressive interface for modeling and querying data. What follows is an overview of several desirable database capabilities, with accompanying discussions of what Cassandra has to offer in each category.
Horizontal scalability refers to the ability to expand the storage and processing capacity of a database by adding more servers to a database cluster. A traditional single-master database’s storage capacity is limited by the capacity of the server that hosts the master instance. If the data set outgrows this capacity, and a more powerful server isn’t available, the data set must be sharded among multiple independent database instances that know nothing of each other. Your application bears responsibility for knowing to which instance a given piece of data belongs.
Cassandra, on the other hand, is deployed as a cluster of instances that are all aware of each other. From the client application’s standpoint, the cluster is a single entity; the application need not know, nor care, which machine a piece of data belongs to. Instead, data can be read or written to any instance in the cluster, referred to as a node; this node will forward the request to the instance where the data actually belongs.
The result is that Cassandra deployments have an almost limitless capacity to store and process data; when additional capacity is required, more machines can simply be added to the cluster. When new machines join the cluster, Cassandra takes care of rebalancing the existing data so that each node in the expanded cluster has a roughly equal share.
Note: Cassandra is one of the several popular distributed databases inspired by the Dynamo architecture, originally published in a paper by Amazon. Other widely used implementations of Dynamo include Riak and Voldemort. You can read the original paper at https://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf.
[Related Page: Migrating Data From RDBMS to Other Database With Cassandra]
The simplest database deployments are run as a single instance on a single server. This sort of configuration is highly vulnerable to interruption: if the server is affected by a hardware failure or network connection outage, the application’s ability to read and write data is completely lost until the server is restored. If the failure is catastrophic, the data on that server might be lost completely.
A master-follower architecture improves this picture a bit. The master instance receives all write operations, and then these operations are replicated to follower instances. The application can read data from the master or any of the follower instances, so a single host becoming unavailable will not prevent the application from continuing to read data. A failure of the master, however, will still prevent the application from performing any write operations, so while this configuration provides high read availability, it doesn’t completely provide high availability.
Cassandra, on the other hand, has no single point of failure for reading or writing data. Each piece of data is replicated to multiple nodes, but none of these nodes holds the authoritative master copy. If a machine becomes unavailable, Cassandra will continue writing data to the other nodes that share data with that machine, and will queue the operations and update the failed node when it rejoins the cluster. This means in a typical configuration, two nodes must fail simultaneously for there to be any application-visible interruption in Cassandra’s availability.
When you create a keyspace—Cassandra’s version of a database—you specify how many copies of each piece of data should be stored; this is called the replication factor. A replication factor of 3 is a common and good choice for many use cases.
The Cassandra read process can be briefly illustrated by the diagram below.
Traditional relational and document databases are optimized for read performance. Writing data to a relational database will typically involve making in-place updates to complicated data structures on disk, in order to maintain a data structure that can be read efficiently and flexibly. Updating these data structures is a very expensive operation from a standpoint of disk I/O, which is often the limiting factor for database performance. Since writes are more expensive than reads, you’ll typically avoid any unnecessary updates to a relational database, even at the expense of extra read operations.
Cassandra, on the other hand, is highly optimized for write throughput, and in fact never modifies data on disk; it only appends to existing files or creates new ones. This is much easier on disk I/O and means that Cassandra can provide astonishingly high write throughput. Since both writing data to Cassandra, and storing data in Cassandra, are inexpensive,denormalization carries little cost and is a good way to ensure that data can be efficiently read in various access scenarios.
Note: Because Cassandra is optimized for write volume, you shouldn’t shy away from writing data to the database. In fact, it’s most efficient to write without reading whenever possible, even if doing so might result in redundant updates.
Just because Cassandra is optimized for writes doesn’t make it bad at reads; in fact, a well-designed Cassandra database can handle very heavy read loads with no problem. We’ll cover the topic of efficient data modeling in great depth in the next few chapters.
[Related Page: Apache Cassandra NoSQL Performance Management ]
The first three database features we looked at are commonly found in distributed data stores. However, databases like Riak and Voldemort are purely key-value stores; these databases have no knowledge of the internal structure of a record that’s stored at a particular key. This means useful functions like updating only part of a record, reading only certain fields from a record, or retrieving records that contain a particular value in a given field are not possible.
Relational databases like PostgreSQL, document stores like MongoDB, and, to a limited extent, newer key-value stores like Redis do have a concept of the internal structure of their records, and most application developers are accustomed to taking advantage of the possibilities this allows. None of these databases, however, offer the advantages of a masterless distributed architecture.
In Cassandra, records are structured much in the same way as they are in a relational database—using tables, rows, and columns. Thus, applications using Cassandra can enjoy all the benefits of masterless distributed storage while also getting all the advanced data modeling and access features associated with structured records.
[Related Page: Apache Cassandra Data Security Management ]
A secondary index, commonly referred to as an index in the context of a relational database, is a structure allowing efficient lookup of records by some attribute other than their primary key. This is a widely useful capability: for instance, when developing a blog application, you would want to be able to easily retrieve all of the posts written by a particular author. Cassandra supports secondary indexes; while Cassandra’s version is not as versatile as indexes in a typical relational database, it’s a powerful feature in the right circumstances.
It’s quite common to want to retrieve a record set ordered by a particular field; for instance, a photo sharing service will want to retrieve the most recent photographs in descending order of creation. Since sorting data on the fly is a fundamentally expensive operation, databases must keep information about record ordering persisted on disk in order to efficiently return results in order. In a relational database, this is one of the jobs of a secondary index.
In Cassandra, secondary indexes can’t be used for result ordering, but tables can be structured such that rows are always kept sorted by a given column or columns, called clustering columns. Sorting by arbitrary columns at read time is not possible, but the capacity to efficiently order records in any way, and to retrieve ranges of records based on this ordering, is an unusually powerful capability for a distributed database.
[Related Page: Using Cassandra in Production Environments]
When we write a piece of data to a database, it is our hope that that data is immediately available to any other process that may wish to read it. From another point of view, when we read some data from a database, we would like to be guaranteed that the data we retrieve is the most recently updated version. This guarantee is called immediate consistency, and it’s a property of most common single-master databases like MySQL and PostgreSQL.
Distributed systems like Cassandra typically do not provide an immediate consistency guarantee. Instead, developers must be willing to accept eventual consistency, which means when data is updated, the system will reflect that update at some point in the future. Developers are willing to give up immediate consistency precisely because it is a direct tradeoff with high availability.
In the case of Cassandra, that tradeoff is made explicit through tunable consistency. Each time you design a write or read path for data, you have the option of immediate consistency with less resilient availability, or eventual consistency with extremely resilient availability. We’ll cover consistency tuning in great detail in Chapter 10, How Cassandra Distributes Data.
[Related Page: Apache Cassandra Architecture Overview ]
While it’s useful for records to be internally structured into discrete fields, a given property of a record isn’t always a single value like a string or an integer. One simple way to handle fields that contain collections of values is to serialize them using a format like JSON, and then save the serialized collection into a text field. However, in order to update collections stored in this way, the serialized data must be read from the database, decoded, modified, and then written back to the database in its entirety. If two clients try to perform this kind of modification to the same record concurrently, one of the updates will be overwritten by the other.
For this reason, many databases offer built-in collection structures that can be discretely updated: values can be added to, and removed from collections, without reading and rewriting the entire collection. Cassandra is no exception, offering list, set, and map collections, and supporting operations like “append the number 3 to the end of this list”. Neither the client nor Cassandra itself needs to read the current state of the collection in order to update it, meaning collection updates are also blazingly efficient.
[Related Page: What is NoSQL and Why NoSQL? ]
In real-world applications, different pieces of data relate to each other in a variety of ways. Relational databases allow us to perform queries that make these relationships explicit, for instance, to retrieve a set of events whose location is in the state of New York (this is assuming events and locations are different record types). Cassandra, however, is not a relational database, and does not support anything like joins. Instead, applications using Cassandra typically denormalize data and make clever use of clustering in order to perform the sorts of data access that would use a join in a relational database.
For data sets that aren’t already denormalized, applications can also perform client-side joins, which mimic the behavior of a relational database by performing multiple queries and joining the results at the application level. Client-side joins are less efficient than reading data that has been denormalized in advance, but offer more flexibility. We’ll cover both of these approaches in Chapter 6, Denormalizing Data for Maximum Performance.
Below is a diagram that illustrates Cassandra write process.
MapReduce is a technique for performing aggregate processing on large amounts of data in parallel; it’s a particularly common technique in data analytics applications. Cassandra does not offer built-in MapReduce capabilities, but it can be integrated with Hadoop in order to perform MapReduce operations across Cassandra data sets, or Spark for real-time data analysis. The DataStax Enterprise product provides integration with both of these tools out-of-the-box.
[Related Page: Apache Hive - Internal and External Tables ]
Now that you’ve got an in-depth understanding of the feature set that Cassandra offers, it’s time to figure out which features are most important to you, and which database is the best fit. The following table lists a handful of commonly used databases, and key features that they do or don’t have:
|Discretely writable collections||Yes||Yes||Yes||Yes||No|
[Related Page: Cassandra vs MongoDB]
Another key aspect of your DBA job is to ensure the databases you manage are always available for the applications that use them. One thing you will like about Cassandra is that, compared to an RDBMS, ensuring constant uptime is very easy. There is no need for specialized, add-on log shipping software such as Oracle Dataguard.
Further, distributing data to multiple geographies and across various cloud providers is much more simple and straightforward with Cassandra than with any RDBMS.
As previously discussed, Cassandra sports a masterless architecture where all nodes are the same; and it has been built from the ground up with the understanding that outages and hardware failures will occur. To overcome those and similar issues, Cassandra delivers redundancy in both data and function to a database cluster with all nodes being the same.
Where data operations are concerned, any node in a cluster may be the target for both reads and writes. Should a particular node go down, there is no hiccup in the cluster at all, as any other node may be written to, with reads served from other nodes holding copies of the downed node’s data.
To ensure constant access to data, you should configure Cassandra’s replication to keep multiple copies of data on the nodes that comprise a database cluster. The number of data copies is completely up to you, with three being the most commonly used in production Cassandra environments.
Should a node go down, new or updated information is simply written to another node that keeps a copy of that data. When the downed node is brought back online, it automatically re-syncs with other nodes holding its data so that it is brought back up to date in a transparent fashion.
Cassandra is the leading distributed database for multi-data center and cloud support. Many production Cassandra systems consist of a database cluster that spans multiple physical data centers, cloud availability zones, or a combination of both. Should a large outage occur in a particular geographical region, the database cluster continues to operate as normal with the other data centers, assuming the operations previously directed at the now downed data center or cloud zone. Once the downed data center comes back online, it syncs with the other data centers and makes itself current.
Multi Data Center – The ring of Cassandra cluster can spread across multiple data centers (DC). Cassandra supports, both, virtual DC as well as physical DC. For example, we have two data centers located at two different geographical locations- Douglous County (US) and Rotterdam (EU). The data is stored across both the DCs, but when we look at the Cassandra’s ring of cluster, all nodes give us a similar output to any query, while keeping the local DC and distant DC behavior intact. Queries can be performed in local DC, only, or across all DCs in the ring.
A single Cassandra database cluster can span multiple data centers and the cloud.
An additional benefit of having a single cluster that spans multiple data centers and geographies is that data can be read and written to incredibly quickly in each location, thus keeping performance very high for the customers it serves in those locations.
Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.