Apache Cassandra Data Model Best Practices
Picking the right data model is the hardest part of using Cassandra. If you have a relational background, CQL will look familiar, but the way you use it can be very different.
DATA AND OBJECT MANAGEMENT
This section takes a look at Cassandra’s data model, what data objects are used for managing data, CQL (Cassandra Query Language), and how transactions are handled in the database.
Data Model Overview
Achieving success with Cassandra almost always comes down to getting two things right:
1. The data model
2. The selected hardware, especially the storage subsystem
Cassandra is a wide row store database that uses a highly denormalized model designed to quickly capture and query data. There are no concepts of foreign keys, referential integrity, or joins in Cassandra (or in most any other NoSQL database).
Although Cassandra has objects that resemble an RDBMS (e.g. tables, primary keys, indexes), data should not be modeled in a legacy entity-relationship-attribute fashion as it is done with a relational database. Modeling data in Cassandra is done by understanding what questions you will need to ask the database up front, whereas in an RDBMS, you are likely not used to addressing such things until after all entities, relationships, and attributes are documented.
Unlike an RDBMS that penalizes the use of many columns in a table, Cassandra is highly performant with tables that have hundreds of columns. As a DBA, used to highly normalized, third normal form models that you translate into a set of physical tables and their accompanying indexes and such. With Cassandra, you will oftentimes, instead have wide row tables with some data duplication between tables.
Creating your physical objects, however, still looks very much like what you carry out in an RDBMS. For example, a new table defining users for an application might look like the following:
The basic objects you will use in Cassandra include:
• Keyspace – a container for data tables and indexes; analogous to a database in many RDBMSs. It is also the level at which replication is defined.
• Column Family / Table – somewhat like an RDBMS table only but much more flexible and capable of handling all modern data types. A table also provides very fast row inserts, but the column level reads for certain queries.
• Primary key – used to uniquely identify a row in a table and also distribute a table row across multiple nodes in a cluster.
• Index – similar to an RDBMS index, in that it speeds read operations able to use it.
• User – a login account used to access data objects.
Cassandra Query Language
Cassandra Query Language or CQL is a declarative language that enables users to query Cassandra using a language similar to SQL. CQL was introduced in Cassandra version 0.8 and is now the preferred way of retrieving data from Cassandra. Prior to the introduction of CQL, Thrift an RPC based API, was the preferred way of retrieving data from Cassandra. A major benefits of CQL is its similarity to SQL and thus helps lower the Cassandra learning curve. CQL is SQL minus the complicated bits. Think of CQL as a simple API over Cassandra’s internal storage structures.
While Cassandra does not offer complex/nested transactions in the same way that your legacy RDBMSs offer ACID transactions, it does offer the “AID” portion of ACID, in that data written is atomic, isolated, and durable. The “C” of ACID does not apply to Cassandra, as there is no concept of referential integrity or foreign keys.
With respect to data consistency, Cassandra offers tunable data consistency across a database cluster. This means you can decide exactly how strong (e.g., all nodes must respond) or eventual (e.g., just one node responds, with others being updated eventually) you want data consistency to be for a particular transaction, including transactions that are batched together. This tunable data consistency is supported across single or multiple data centers, and you have a number of different consistency options from which to choose.
Moreover, consistency can be handled on a per operation basis, meaning you can decide how strong or eventual consistency should be per SELECT, INSERT, UPDATE, and DELETE operation. For example, if you need a particular transaction available on all nodes throughout the world, you can specify that all nodes must respond before a transaction is marked complete. On the other hand, a less critical piece of data (e.g., a social media update) may only need to be propagated eventually, so in that case, the consistency requirement can be greatly relaxed.
Cassandra also supplies “lightweight transactions” (or compare and set). Using and extending the Paxos consensus protocol (which allows a distributed system to agree on proposed data modifications without the need for anyone ‘master’ database or two phase commit), Cassandra offers a way to ensure a transaction isolation level similar to the serializable level offered by RDBMSs for situations that need it.
DBA Query and Management Tools
As DBA is coming from the RDBMS world, you likely use many command line and visual tools for interacting with the databases you manage. The same kind of tools are available to you with Cassandra.
Various command line utilities are provided for handling administration functions (e.g. the node tool utility), loading data, and using CQL to create and query database objects (the CQL shell, which is much like Oracle’s SQL*Plus or the MySQL shell).
In addition, graphical tools are provided for running CQL commands against database clusters (e.g. DataStax DevCenter) and visually creating/managing/monitoring your clusters (DataStax OpsCenter).
DataStax OpsCenter, used for visual database administration.
DataStax DevCenter, used for visually querying databases.