Big Data In Practice With Cassandra – Data Science
Big Data is a fast growing trend in enterprise applications that comes with a novel promise compare to past technological revolutions: being able to retrieve and manipulate as much data as necessary to bring up new use cases or to improve the user experience.
Time to meet the Apache Cassandra NoSQL open-source distributed database management system. Cassandra is an absolutely essential tool for data scientists. There is virtually no company today concerned with large, active data sets which does not use Cassandra. The short list? Netflix, Twitter, Reddit, Cisco, OpenX, Digg, CloudKick, and Ooyala.
Cassandra offers all-important linear scalability and reliable fault-tolerance – the two key attributes of any platform required to manage mission-critical data. Per Apache: The platform offers optimal support for replicating across multiple data-centers; in this it allows lower latency and protection against regional outages. In short, Cassandra is a brilliantly efficient, non-traditional database that’s been designated to easily scale up to massive data sets.
This free distribution from the Apache Software Foundation offers column indexes, log-structured updates, support for materialized views, and elegant built-in caching.
The database is completely fault tolerant. Cassandra automatically replicates (backs-up) data to multiple nodes or, if you prefer, to multiple data centers. Failing nodes can be replaced with absolutely no downtime interruption. This is, of course, a decentralized platform. Every node in a cluster is identical. There are no network bottlenecks, and no single points of failure.
At the same time, Cassandra is elastic. When new machines are added, read and written, throughput increase linearly. There’s no downtime, no interruption.
As of Release 1.0 in October of 2011, the system’s interface has been greatly simplified from that of previous beta releases. “We’re consciously signaling that Cassandra is ready for mere mortals,” said Jonathan Ellis, who is Apache’s vice president in charge of the Cassandra project, jokingly referring to the amount administrative expertise needed to deploy previous versions of the software. “Dealing with very large amounts of data in real-time is a must for most businesses today. Cassandra accommodates high query volumes, provides enterprise-grade reliability, and scales easily to meet future growth requirements -while using fewer resources than traditional solutions.”
Ellis says the difference between traditional databases like MySQL and Cassandra is the difference between analytic big data and real-time big data. Hadoop itself is strictly an analytical system rather than a real time or transaction oriented system (ala Cassandra). Ellis: “On the real-time side, Cassandra’s strongest competitors are probably Riak and HBase. Riak is backed by Basho, and Cloudera supports HBase although it’s not their focus. For analytics, everyone is standardizing on Hadoop, and there are a number of companies pushing that. … “
Users find Cassandra indispensable.
“As the most-widely deployed mobile rich media advertising platform, Medialets uses Apache Cassandra for handling time series based logging from our production operations infrastructure,” says Joe Stein, Chief Architect of Medialets. “We store contiguous counts for data points for each second, minute, hour, day, month so we can review trends over time as well as the current real time set of information for tens of thousands of data points. Cassandra makes it possible for us to manage this intensive data set … “
Matthew Conway, CTO of Backupify notes: “Apache Cassandra makes it possible for us to build a business around really high write loads in a scalable fashion without having to build and operate our own sharing layer. The [latest] release of Cassandra …is an exciting milestone for the project and we look forward to exploring the new features and performance enhancements.”