Data Analytics and Searching Data in Cassandra
ANALYZING AND SEARCHING DATA
Many applications have requirements that their underlying transactional database easily service analytic and search operations. As a DBA, you are likely familiar with analytic capabilities that can be run via SQL and full-text search options in RDBMS’s, and might wonder how the same things are handled in Cassandra.
Because Cassandra has a distributed, shared-nothing architecture, the framework for running analytics on it compared to a centralized RDBMS will be different.
There are three options in DataStax Enterprise that allow you to run analytic operations easily on Cassandra data. You can run both real-time and batch (i.e. longer running) analytics on data via the platform’s built-in components that utilize Apache Spark for real-time analytics and various Hadoop components such as MapReduce, Hive, Pig, and Mahout for longer running batch analytics.
The analytics capability in the platform provides you with a number of the SQL functions and abilities that you are used to in the RDBMS world (e.g. joins, aggregate functions, etc.) In addition, analytics can be run across multiple data centers and cloud availability zones. Built-in continuous availability options are also in the platform.
External Hadoop Support
You also have the ability to connect the data in DataStax Enterprise to an external Hadoop cluster and run analytic queries on data that combines both the operational data in Cassandra with historical data stored in a Hadoop deployment such as Cloudera or Hortonworks (e.g. a single query can join a Cassandra table with a Hadoop Hive table). If you have used RDBMS connection options such as Oracle’s database links or Microsoft SQL Server’s linked servers to integrate external database systems, the concept is somewhat similar.
DataStax Enterprise Search is a high performance real-time live indexing engine with powerful search capabilities tightly integrated into DataStax Enterprise (DSE). Because DSE Search is a core capability of the DataStax platform, you don’t need to maintain or replicate data in third party systems. DSE Search allows you to quickly find data using complex, sub-string, fuzzy, and full-text search queries. DSE Search also provides full support for real-time aggregations, faceting, and filtering making it easy to locate data that is important to your users.
Workload Management for Analytics and Search
DataStax Enterprise solves the mixed workload problem enterprises have struggled with for years. DataStax Enterprise provides full workload management so that real-time, analytic, and search workloads do not compete with each other for either compute or data resources. Further, DataStax Enterprise puts an end to complex and time-consuming ETL operations as data is transparently replicated between all Cassandra, Hadoop, and Solr nodes in a DataStax Enterprise cluster.
When enabling analytics and search on a database cluster, you have a number of configuration options available. If you choose, you can run transactional (OLTP), analytics, and search operations on all nodes in a database cluster.
Another deployment methodology includes separating OLTP, analytics, and search workloads so that each runs on its own series of nodes. This strategy ensures that differing workloads do not compete with each other for either compute or data resources. Replication can be set up between all nodes so that data is transparently replicated to each set of nodes without manual intervention.
This translates into your not having to worry about complex ETL jobs that transfer data between different systems, as you might be used to doing for your RDBMSs.
Specifying certain workloads for certain nodes in a cluster.