Big Data solutions for SQL Server
Big Data is large amount of the data which is difficult or impossible for traditional relational database. Big data, the term has seen increasing use since the past few years. In this field, we review the various ways that big data is described and how Hadoop which is developed as a technology, is commonly used to process big data. In addition, we introduce Microsoft HDInsight, an implementation of Hadoop available as a Windows Azure service. Then we explore Microsoft PolyBase, an on-premises solution that integrates relational data stored in Microsoft SQL Server Parallel Data Warehouse (PDW) with non-relational data stored in a Hadoop Distributed File System (HDFS).
For several decades, many organizations have been analyzing data generated by transactional systems. This data has usually been stored in a relational database management systems. A common step in the development of a business-intelligence solution is weighing the cost of transforming, cleansing, and storing this data in preparation for analysis against the perceived value that insights derived from the analysis of the data could deliver. As a consequence, decisions are made about what data to keep and what data to ignore. Meanwhile, the data available for analysis continues to proliferate from a broad assortment of sources, such as server log files, social media, or instrument data from scientific research. At the same time, the cost to store high volumes of data on commodity hardware has been decreasing, and the processing power necessary for complex analysis of all this data has been increasing. This confluence of events has given rise to new technologies that support the management and analysis of big data.
Describing big data
The point at which data becomes big data is still the subject of much debate among data-management professionals. One approach of describing big data is known as the 3Vs: volume, velocity, and variety. This model, introduced by Gartner analyst Doug Laney in 2001, has been extended with a fourth V, variability. However, disagreement continues, with some people considering the fourth V to be veracity.
Although it seems reasonable to associate volume with big data, how is a large volume different from the very large databases (VLDBs) and extreme workloads that some industries routinely manage? Examples of data sources that fall into this category include airline reservation systems, point of sale terminals, financial trading, and cellular-phone networks. As machine-generated data outpaces human-generated data, the volume of data available for analysis is proliferating rapidly. Many techniques, as well as software and hardware solutions such as PDW, exist to address high volumes of data. Therefore, many people argue that some other characteristic must distinguish big data from other classes of data that are routinely managed.
Some people suggest that this additional characteristic is velocity or the speed at which the data is generated. As an example, consider the data generated by the Large Hadron Collider experiments, which is produced at a rate of 1 gigabyte (GB) per second. This data must be subsequently processed and filtered to provide 30 petabytes (PB) of data to physicists around the world. Most organizations are not generating data at this volume or pace, but data sources such as manufacturing sensors, scientific instruments, and web-application servers are nonetheless generating data so fast that complex event-processing applications are required to handle high-volume and high-speed throughputs. Microsoft StreamInsight is a platform that supports this type of data management and analysis.
Data does not necessarily require volume and velocity to be categorized as big. Instead, a high volume of data with a lot of variety can constitute big data. Variety refers to the different ways that data might be stored: structured, semistructured, or unstructured. On the one hand, data-warehousing techniques exist to integrate structured data (often in relational form) with semistructured data (such as XML documents). On the other hand, unstructured data is more challenging, if not impossible, to analyze by using traditional methods. This type of data includes documents in PDF or Word format, images, and audio or video files, to name a few examples. Not only the unstructured data problematic for analytical solutions, but it is also growing more quickly than file systems on a single server that it can usually accommodate.
Big data as a branch of data management is still difficult to define with precision, given that many competing views exist and that no clear standards or methodologies have been established. Data that looks big to one organization by any of the definitions we’ve described might look small to another organization that has evolved solutions for managing specific types of data. Perhaps the best definition of big data at present is also the most general. For the purpose of this chapter, we take the position that big data describes a class of data that requires a different architectural approach than the currently available relational database systems it can effectively support, such as append-only workloads instead of updates.
Exploring the history of Hadoop
Hadoop is an open source technology that today, it is the data management platform, most commonly associated with big data applications.
Any research into potential architectural solutions for managing big data inevitably leads to a mention of Hadoop, a technology whose first iteration, dubbed Nutch, was developed by Doug Cutting and Mike Cafarella in 2002, as a way to crawl and index webpages across the Internet. Nutch, however, had limited scalability and was prone to fail. After Google released the Google File System paper in 2003 and the MapReduce paper in 2004, Cutting and Cafarella were inspired to design a distributed file system and MapReduce processing framework that gave more scalability and reliability to Nutch.
When Cutting joined Yahoo in 2006, the storage and processing components of Nutch were separated and contributed to the Apache Software Foundation as an open-source software project named Hadoop (after a toy elephant belonging to Cutter’s son). Even then, the focus of Hadoop was to facilitate web searches, but its scalability and reliability remained limited. Ongoing investment by Yahoo increased scalability from dozens of nodes to thousands over a period of several years. Meanwhile, Yahoo began storing more and more data in Hadoop and allowed data scientists to research and analyze this data, which provided the feedback necessary to develop new applications and attain a greater level of maturity in the platform. Externally, the open-source status of Hadoop attracted attention from academics and investors as a general-purpose computing platform more than for its origins as a web search engine.
Hadoop is attractive for general use because of its scale-out architecture on commodity hardware and its support for parallel processing on a large scale. As an Apache open-source project, Hadoop is not a software application but rather it is a framework consisting of multiple modules:
Hadoop Common package It consists of Java libraries and utilities necessary to run Hadoop modules, source code, and documentation.
Hadoop Distributed File System (HDFS) The distributed filesystem is that far-flung array of storage clusters i.e., the Hadoop component that holds the actual data. By default, Hadoop uses the cleverly named Hadoop Distributed File System (HDFS), although it can use other file systems as well. HDFS is like the bucket of the Hadoop system.
It is a distributed file system that replicates large files across multiple nodes in case of potential hardware failure. A file is split into blocks, and then each block is copied to multiple machines. A centralized metadata store called the name node contains the locations for each part of a file.
MapReduce engine A programming framework that supports distributed processing of jobs in parallel, within a cluster of server nodes. A MapReduce program requires a Map() procedure to perform a specific task across multiple nodes, such as a word count, and a Reduce() procedure to consolidate the results from these nodes in summary form. The engine automatically manages distributed processing by partitioning the data to be processed, scheduling the job across nodes, assigning an alternate node if an assigned node fails, and then aggregating the results from each node into a single result.
Hadoop YARN A resource-management platform that controls resources and schedules user applications for Hadoop 2.0 distributions and higher.
From a data-management perspective, one of the distinguishing features of Hadoop as it relates to large data volumes is the movement of processing logic to the data, rather than moving the data to a location where it is processed. Consider a data-warehouse solution in which data is extracted from one or more sources and moved to a new location after transformations have been applied to cleanse and restructure the data. Moving 1 PB of data across a 1-GB network into a data warehouse takes time and consumes network bandwidth. In a Hadoop implementation, after data is added to HDFS, the data remains in place, the processing moves to the data, and only the results of the processing are sent back to the user.
Several Apache open-source projects related to Hadoop are useful for data warehousing and BI applications that rely on a Hadoop implementation:
Pig An infrastructure that uses a scripting language called Pig Latin to generate MapReduce jobs that read data from HDFS, transform data (through filtering, grouping, merging, or splitting the data, for example), and outputs the results to the console or to HDFS. Learn more at http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-pig.
Hive A data-warehouse solution made up of a metastore that facilitates access to data stored in HDFS through a scripting language called HiveQL. HiveQL is syntactically similar to SQL, with which database professionals are already familiar. Additional information is available at http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-hive.
Sqoop A set of utilities that uses MapReduce and parallelized operations to move data in either direction between Hadoop and a relational database. Read details about Sqoop at http://www.windowsazure.com/en-us/documentation/articles/hdinsight-upload-data.
Oozie A workflow manager that allows you to schedule MapReduce, Pig, Hive, and Sqoop jobs by using hPDL, an XML Process Definition Language is used to define workflows. You can read about it at http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-oozie.