Using Hadoop for Data Science
High performance data analysis is a required competitive component, providing valuable insight into the behavior of customers, market trends, scientific data, business partners, and internal users. Explosive growth in the amount of data businesses must track has challenged legacy database platforms. New unstructured, text-centric, data sources, such as feeds from Facebook and Twitter do not fit into the structured data model. These unstructured datasets tend to be very big and difficult to work with. They demand distributed (aka parallelized) processing.
Hadoop, an open source software product, has emerged as the preferred solution for Big Data analytics. Because of its scalability, flexibility, and low cost, it has become the default choice for Web giants that are dealing with large-scale clickstream analysis and ad targeting scenarios. For these reasons and more, many industries who have been struggling with the limitations of traditional database platforms are now deploying Hadoop solutions in their data centers. (These industries are also looking for the economy. According to some recent research from Infineta Systems, a WAN optimization startup, traditional data storage runs $5 per gigabyte, but storing the same data costs about 25 cents per gigabyte using Hadoop.)
Businesses are finding they need faster insight and deeper analysis of their data – slow performance equates to lost revenue. A Hadoop – available in customized, proprietary versions from a range of vendors – provides a solid answer to this dilemma.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop was originally conceived on the basis of Google’s MapReduce, in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes.
A distributed file system (DFS) facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. The risk of catastrophic system failure is low, even if a significant number of nodes become inoperative.
The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux, but Hadoop can also work with BSD and OS X. (A bit of trivia: the name Hadoop was inspired by the name of a stuffed toy elephant belonging to a child of the framework’s creator, Doug Cutting.)
Hadoop lies, invisibly, at the heart of many Internet services accessed daily by millions users around the world.
“Facebook uses Hadoop … extensively to process large data sets,” says Ashish Thusoo, Engineering Manager at Facebook. “This infrastructure is used for a variety of different jobs – including adhoc analysis, reporting, index generation and many others. We have one of the largest clusters with a total storage disk capacity of more than 20PB and with more than 23000 cores. We also use Hadoop and Scribe for log collection, bringing in more than 50TB of raw data per day. Hadoop has helped us scale with these tremendous data volumes.”
“Hadoop is a key ingredient in allowing LinkedIn to build many of our most computationally difficult features, allowing us to harness our incredible data about the professional world for our users,” comments Jay Kreps, LinkedIn’s Principal Engineer.
Let’s not forget Twitter. “Twitter’s rapid growth means our users are generating more and more data each day. Hadoop enables us to store, process, and derive insights from our data in ways that wouldn’t otherwise be possible. We are excited about the rate of progress that Hadoop is achieving, and will continue our contributions to its thriving open source community,” notes Kevin Weil, Twitter’s Analytics Lead.
Then we have eBay. During 2010, eBay erected a Hadoop cluster spanning 530 servers. By December of 2011, the cluster was five times that large, helping with everything from analyzing inventory data to building customer profiles using real-time online behavior. “We got tremendous value – tremendous value – out of it, so we’ve expanded to 2,500 nodes,” says Bob Page, eBay’s vice president of analytics. “Hadoop is an amazing technology stacks. We now depend on it to run eBay.”
“Hadoop has been called the next-generation platform for data processing because it offers low cost and the ultimate in scalability. But Hadoop is still immature and will need serious work by the community … ” writes InformationWeek’s Doug Henschen.
“Hadoop is at the center of this decade’s Big Data revolution. This Java-based framework is actually a collection of software and subprojects for distributed processing of huge volumes of data. The core approach is MapReduce, a technique used to boil down tens or even hundreds of terabytes of Internet clickstream data, log-file data, network traffic streams, or masses of text from social network feeds.”
Henschen continues: “The clearest sign that Hadoop is headed mainstream is that fact that it was embraced by five major database and data management vendors in 2011, with EMC, IBM, Informatica, Microsoft, and Oracle all throwing their hats into the Hadoop ring. IBM and EMC released their own distributions last year, the latter in partnership with MapR. Microsoft and Oracle have partnered with Hortonworks and Cloudera, respectively. Both EMC and Oracle have delivered purpose-built appliances that are ready to run Hadoop. Informatica has extended its data-integration platform to support Hadoop, and it’s also bringing its parsing and data-transformation code directly into the environment.”
Still, says Henschen, Hadoop remains “downright crude compared to SQL [Structured Query Language, traditionally used to parse structured data]. Pioneers, most of whom started working on the framework at Internet giants such as Yahoo, have already put at least six years into developing Hadoop. But success has brought mainstream demand for stability, robust administrative and management capabilities, and the kind of rich functionality available in the SQL world. … Data processing is one thing, but what most Hadoop users ultimately want to do is analyze the data. Enter Hadoop-specialized data access, business intelligence, and analytics vendors such as Datameer, Hadapt, and Karmasphere.” (All three of these firms are to be discussed in later chapters.) (Note that the most popular approaches to parsing unstructured data are often referred to as NoSQL approaches/ tools.)
Created from work done in Apache by Hortonworks, Yahoo! and the rest of the vibrant Apache community, Hadoop 0.23 – the first major update to Hadoop in three years – provides the critical foundation for the next wave of Apache Hadoop innovation. Featuring the next-generation MapReduce architecture, HDFS Federation and High Availability advancements, Hadoop 0.23 was released November 11, 2012.
Additional Hadoop-related projects at Apache include: Avro, a data serialization system; Cassandra, a scalable multi-master database with no single points of failure; Chukwa, data collection system for managing large distributed systems; HBase, a scalable, distributed database that supports structured data storage for large tables; Hive, a data warehouse infrastructure that provides a data summarization and ad hoc querying; Mahout, a scalable machine learning and data mining library; Pig, high-level data-flow language and execution framework for parallel computation; and ZooKeeper, a high-performance coordination service for distributed applications.
Note: Hadoop works best in collaboration with several specially designed tools. For example, Apache Pig is a high-level procedural language for querying large semi-structured data sets using Hadoop and the MapReduce Platform. Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed data set. Then we have Apache Hbase. Use HBase when you need random, real-time read/write access to your Big Data. Hbase enables the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
We need to note two more items: Sqoop and Hive. Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Hadoop Hive, meanwhile, constitutes a robust data warehouse infrastructure, providing powerful data summarization and ad hoc querying capabilities.
Note: Although clearly dominant, a Hadoop – it should be mentioned – is not the only technology that sits under the Big Data umbrella. Other methodologies include columnar databases, which organize data by columns instead of rows and lend themselves to analytical data warehousing and compression.