How to get started with PolyBase in SQL Server 2016
One of the more intriguing features included with SQL Server is PolyBase, a transparent access layer that facilitates connectivity between the database engine and external data sources containing unstructured or semi-structured data. PolyBase is optimized for data warehouse workloads and analytical query processing, making it easier than ever to merge big data into the SQL Server universe. It is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
PolyBase was developed at the Microsoft Jim Gray Systems Lab at the University of Wisconsin-Madison under the direction of Dr. David DeWitt, a Microsoft Technical Fellow. It provides an interface that allows you to work with data stored in HDFS by using SQL syntax in PDW queries—in a manner similar to querying a linked server from SQL Server—rather than MapReduce jobs. You can even use PolyBase to join relational data in PDW with data in HDFS, as shown in Figure 6.7. In addition, you can use PolyBase to move data from PDW to HDFS or vice versa. Furthermore, you can use Power Query or Power Pivot to connect to PDW and use PolyBase to import data from HDFS into Excel.
FIGURE 6.7 : PolyBase as HDFS bridge between Hadoop and PDW.
Exploring the benefits of PolyBase
The most obvious benefit of the availability of PolyBase in PDW is the ability to combine both relational and nonrelational data into a single result set, but there are several others. In particular, database professionals already familiar with developing SQL queries to retrieve data from PDW for reporting and analytical applications have nothing new to learn when they need to query nonrelational data. There is no need to learn MapReduce, nor is there any need to learn how to use the other tools in the Hadoop ecosystem, such as HiveQL, Pig, or Sqoop. Existing SQL skills are sufficient.
Another benefit is faster results from queries to HDFS. PolyBase is able to perform read and write operations in parallel much faster by taking advantage of the massively parallel processing (MPP) of PDW. Whereas using Sqoop is effective for moving data in and out of a relational database, it processes data serially and interfaces with the PDW control node. By contrast, PolyBase not only parallelizes data transfers, but also moves data directly from Hadoop data nodes to PDW compute nodes, as shown in Figure 6.8.
FIGURE 6.8 : Parallel data transfer between PDW compute nodes and Hadoop data nodes.
Lastly, PolyBase is highly flexible. It is not limited to a single operating system or Hadoop distribution. In addition, it supports any type of HDFS file format. This means that you can use PolyBase to deliver data from all types of unstructured sources across the entire Microsoft BI stack. You can connect to PDW with ad hoc analysis tools like Excel and Power BI or distribute standard reports by using Reporting Services. You can even use data from PDW in Analysis Services multidimensional or tabular models to enrich it with business logic and other BI features.
Accessing HDFS data from PDW
The work that PolyBase performs to retrieve data from HDFS is transparent during queries. The only requirement is that you create an external table to define the schema that PDW can then query. You can then interact with data in HDFS files in multiple ways, either by moving data between systems or by querying both systems and joining the results.
Let’s say you want to export data from a PDW table called FactInventory and store the results in a text file on your on-premises Hadoop cluster. You use Create Table As Select syntax to create an external table and transfer data from PDW into a file in HDFS, like this:
CREATE EXTERNAL TABLE FactInventory_Export
(LOCATION = hdfs://10.10.10.100:8020/data/FactInventory_Export.txt, FORMAT OPTIONS (FIELD_TERMINATOR = ‘|’)
AS SELECT * FROM FactInventory;
Another option is to create an external table that references data which already stored in HDFS, as in this example:
CREATE EXTERNAL TABLE ServerLogs( machineName varchar(50), eventDate date,
LOCATION = hdfs://10.10.10.100:8020/data/logs.txt’,
FORMAT_OPTIONS( FIELD_TERMINATOR=’|’, DATE_FORMAT = ‘MM/dd/yyyy’
You can then write queries that reference both PDW and external tables pointing to HDFS, as shown here:
SELECT sl.machineName, m.machineDescription, m.machineStartDate, sl.eventDat FROM ServerLogs sl
JOIN DimMachine m
ON sl.machineName = m.machineName