Introduction to Microsoft Power Query for Excel
Power Pivot is a great tool for gathering together data from disparate sources and combining it into a single model for analysis, but it presumes that you know the data exists, where to find it, and how to use DAX to create calculated columns for simple restructuring or cleansing of your data. For discovering data and to use more advanced techniques for transforming data, you can use Power Query. Power Query is a separate downloadable add-in for the following Excel versions: Office Professional Plus 2013, Office 365 Professional Plus, and the standalone edition of Excel 2013.
Note You can download Power Query for Excel (32-bit or 64-bit) from http://www.microsoft.com/en-us/download/details.aspx?id=39379 for the January 2014 release, although a newer version might be available. Search for Power Query in the Microsoft Download Center (http://www.microsoft.com/en-us/download) to locate the latest version by release date. You can also install this add-in for Excel 2010 if you are using Microsoft Office Professional Plus 2010 with Software Assurance through Volume Licensing.
At a minimum, you use Power Query to find the data that you need and then view it in a table after filtering it and shaping it to meet your requirements. You can also load the data retrieved by Power Query into the Data Model so that you can build PivotTables, PivotCharts, or Power View reports. Of course, after the data is in the Data Model, you can further enhance it by using Power Pivot. If you have an Office 365 subscription with Power BI enabled, you can share the queries that you develop, find and use existing shared queries, and monitor the use of shared queries.
Searching for data
One of the distinguishing features of Power Query is its ability to help you find not only internal data sources (if your organization uses Power BI) but also data from public data sources. Internal data sources are curated queries that designated users publish for others to use. To use this feature, you must click Sign In on the Power Query tab on the ribbon and enter your Office 365 logon credentials. Currently, public data sources include only sources based in the United States. The current collection that you can search (subject to change) includes the following data sources:
- Dun & Bradstreet financial data (sample data only)
- MCH Strategic Data (sample data only)
- Open Government data (Data.gov)
- Windows Azure Marketplace
- The World Bank
You start a search by clicking Online Search on the Power Query tab. You then type one or more keywords in the search box. If you used the Sign In button to log on to Office 365, a drop-down list appears to the right of the search box that allows you to restrict your search by choosing one of the following options:
All This is the default selection, with no restriction on which data sources to search.
My Shared This selection forces Power Query to search only the shared queries that you have published to Power BI.
Public When you select this option, Power Query returns only a set of public data sources.
Organization With this selection, Power Query includes all shared queries that you have permission to see, whether created by you or someone else.
A list of results is displayed as a series of pages in the Online Search pane, as shown in Figure 4.4. When you point to one of the results, a preview flyout screen displays a sample of the selected data source, in addition to a list of the columns the data source contains, the last modified date, the name of the source, and a link to the source. Your search keywords are highlighted in yellow.
At the bottom of the preview flyout screen, you can click Add To Worksheet to import the data unchanged into a new worksheet. If you need to modify the data first, click Edit instead. Then you can perform one or more of the steps described in the “Shaping data” section later in this chapter.
FIGURE 4.4 : Power Query online search results.
If you already know where to find the data you want, you can import it directly without performing a search. To do this, click the applicable option for your data source type in the Get External Data group on the Power Query tab (shown in Figure 4.5), and then select the data source.
FIGURE 4.5 : Get External Data options on the Power Query tab.
You can import from the following types of data sources:
Web Provides the URL for a web page containing data in a table format that Power Query can scrape or for a supported file type stored on a web server.
File Select a file type and then browse to the file location to select it. You can choose from the following types of files: Excel, CSV, XML, and text. You can even import metadata about files stored in a folder, such as file name, file name extension, date modified, and path.
Database Select a database type and provide a server name and optionally a database name. You can even provide a SQL statement if you prefer not to work with all the columns or rows of a selected table. You can choose from the following types of relational data sources: SQL Server, Windows Azure SQL Database, Access, Oracle, IBM DB2, MySQL, PostgreSQL, Sybase, and Teradata.
Other Sources Select a source type and then respond to the prompt to provide location information about the source. You can choose from the following types of other sources: SharePoint list, O Data feed, Windows Azure Marketplace, Hadoop, Windows Azure HDInsight, Windows Azure Blob Storage, Windows Azure Table Storage, Active Directory, Exchange, and Facebook. You can also choose Blank Query and then type a Power Query formula that extracts and manipulates data to meet your requirements more specifically.
Table Select a table in the workbook before clicking From Table on the ribbon. As an alternative, you can select a range of cells in a worksheet and then click From Table. Power Query first converts the range of cells to a table and then imports the data into a query.
Note You can learn more about how to work with a specific type of data source at http://office.microsoft.com/en-us/excel-help/import-data-from-external-data-sources HA104003952.aspx.
When working with certain types of data sources, such as relational tables, you have the option to import multiple items from the same source, as shown in Figure 4.6, by selecting the Select Multiple Items check box at the top of the Navigator pane. You can then select the check box for each item to import. When you point to a single item, a preview flyout screen displays a sampling of the data. A separate query is added to the workbook for each item you choose to load.
FIGURE 4.6 : Select multiple items from a source in the Navigator pane and preview data in a selected item.
Note If you import multiple related tables from a relational source at the same time, Power Query detects the existing relationships and automatically adds them to the Data Model. However, if you import more related tables later, any existing relationships between the previously imported tables and the currently selected tables are ignored and not imported into the Data Model.
Loading the worksheet or Data Model
When you finish working with a query, you have the options to load the data either in a worksheet, or into the Excel Data Model, or to both locations. If you want to use the data with Power View or Power Map, you must load the data into the Data Model. This option is not enabled until the download of data from the data source is complete. You use the check boxes at the bottom of the Query Settings pane to make your selection, as shown in Figure 4.7.
FIGURE 4.7 : Load settings in the Query Settings pane.
After using the online search option or importing data, you can manipulate the data in a variety of ways, reshaping the data by applying a series of transformations, filtering it, and splitting columns, to name just a few operations. If you import multiple data sets that share common data columns, you can combine these data sets into a single table to make analyzing easier.
To shape data for a particular query, you must open the Query Editor. You can do this by clicking Workbook on the Manage Queries group on the Power Query tab and then double-clicking the query in the Workbook Queries pane. Another way to open the Query Editor is to open the worksheet containing the table of data associated with the query, click the Query tab below the Table Tools tab, and then click Edit Query on the ribbon. You can then use the options on the Query Editor ribbon, shown in Figure 4.8, to reduce the number of rows or columns in the query results, sort data, apply a transformation such as splitting a column, create a new column, or combine data from multiple queries by merging or appending data. You can also launch these operations by right-clicking a row or column to display a submenu of commands.
FIGURE 4.8 : The Query Editor ribbon.
The process of shaping data can involve one or more steps. As a simple example, let’s look at the data from an online search, shown in Figure 4.9. In this example, the 2010 Land Area column contains data for both square miles and square kilometers, which can better be used in reports if the data is split into separate columns and restricted to the numeric portion of the data. Another column (not visible) contains the latitude and longitude information that should likewise be separated.
FIGURE 4.9 : Data in the Query Editor before transformation.
To edit the query, select the column in the grid and then use the ribbon or the submenu that is displayed when you right-click the column to apply a new step. For example, to separate the data in the 2010 Land Area column, you can use the Split Column By Delimiter transformation. When you specify this type of transformation, a dialog box prompts you for more instructions. In this case, you must choose Custom in the Select Or Enter Delimiter list, type sq mi in the next box, and then specify the position of the delimiter to use for splitting, as shown in Figure 4.10.
FIGURE 4.10 : The Split A Column By Delimiter dialog box displaying a custom delimiter definition.
When you close the dialog box, Power Query creates two columns with the same name and a numeric value appended to uniquely identify each column, as shown in Figure 4.11. At this point, you can rename the columns to more clearly distinguish between the two and continue to apply transformations to produce query results that are more useful in reports.
FIGURE 4.11 Query results after applying the Split Column By Delimiter transformation
Power Query includes the following types of transformations for cleansing and restructuring your data:
Filter Click the arrow icon in a column to display a list of distinct values in the column, and then select the values to keep in the query results. You also have access to text, number, or date filters, just as you do when you use the Excel filter feature on a regular worksheet column. Another way to filter is to right-click a cell containing a value that you want to keep in or exclude from the results, point to Text Filters (or Number Filters or Date Filters), and then select a comparison operator such as Equals or Does Not Equal, among others. The comparison operator is applied to the selected cell value, and the rows are filtered according to the criteria you set.
Sort Select a column, and then click Sort Ascending or Sort Descending on the Query Editor ribbon. If you continue by applying a sort direction to additional columns, the initial sort remains intact, and the additional columns are sorted in the order selected. Click the arrow icon in the column and select Clear Sort to remove the sort transformation from the query steps.
Group rows Click Group By on the Query Editor ribbon, and then select one or more fields to use for grouping rows. You must provide a name for a new column that’s created to hold the aggregated value for the grouped rows, select the aggregate function to use (such as Sum or Count Rows), and specify the column to be aggregated.
Expand column Certain operations return a column of complex values, which is analogous to associating a table of columns and rows with each row in the data grid. You can click the expand icon in the column header and then select the columns to add to the data grid. The expand icon looks like this:
Aggregate When you have a column of complex values, click the expand icon in the column header, select the Aggregate option button in the column drop-down list, point to one of the aggregate functions (such as Sum of SalesAmount), and then select one or more of the available aggregation functions: Sum, Average, Minimum, Maximum, Count (All), and Count (Not Blank).
Insert index or custom column You can add an index column by clicking the Insert Index Column on the Query Editor ribbon. Each row is numbered consecutively, beginning with zero for the first row. Another option is to create a column to contain values calculated from a query formula that you define. Click Insert Custom Column to open a query formula box for the new column and then provide a query formula to calculate a result for each row in the column.
Remove column(s) You can right-click a column you want to delete and then select Remove on the submenu to eliminate the column from the query results. As an alternative, you can right-click a column you want to retain and then select Remove Other Columns to reduce the query results to the single column you selected.
Remove error rows When you want to eliminate rows containing errors from the query results, right-click a column containing error rows and select Remove Errors on the submenu.
Promote row to column headers If the first row of the data grid contains column headers, click the table icon in the upper-left corner of the grid, and then select Use First Row As Headers in the submenu.
Split column Separate a single column into two or more columns by using delimiters or a fixed number of characters.
Merge column The merge operation requires all columns to have the text data type, so you might need to change the data type of some columns first. Select the column, and then select a data type in the Data Type list on the Query Editor ribbon. While pressing the Ctrl key, select two or more columns to merge, and then select a separator to insert between column values, such as a comma or a space. A new column replaces the selected columns in the query results. If you prefer to create a new column with the merged column values, click Insert Custom Column on the Query Editor ribbon and use a formula to concatenate columns, like this: = [City] & “ “ & [State]
As you perform each operation on the data, notice that new steps appear in the Query Settings pane. Figure 4.12 shows the set of steps required to reshape the 2010 Land Area and Location columns. You can click any of the steps to see the shape of the data after the selected step was applied. If you try a transformation that fails to produce the result you want, you can delete the step by clicking the X icon that is displayed to the left of the step when you point to the step’s name. If you need to change a step setting, such as a delimiter value for the Split Column By Delimiter transformation, click the gear icon to the right of the step’s name.
FIGURE 4.12 : The Query Settings pane displaying a series of transformation steps applied to a query.
As you select a step in the Query Settings pane, the formula bar above the data grid displays the query formula, as shown in Figure 4.13. After you learn the Power Query Formula Language, you can build more complex expressions and customize a transformation to better meet your needs.
FIGURE 4-13 The formula bar displaying an example of a query formula for a Split Column By Delimiter transformation.
Note More information about the Power Query Formula Language is available at http://office.microsoft.com/en-us/excel-help/learn-about-power-query-formulas-HA104003958.aspx.
When you finish editing a query, confirm that you have selected the correct Load Settings and then click Apply & Close on the Query Editor ribbon. The data loads as a table in a worksheet, as a table in the Data Model, or both, depending on your Load Settings selections.
If your workbook contains two or more queries that have a column in common—such as when one query returns Sales Header data and another query returns Sales Detail with a common sales order number column in each query—you can use an inline merge to merge data as a step in a single query. You do this by clicking the table icon in the upper-left corner of the grid and then selecting Merge. Your other option is to use an intermediate merge to create a separate query for each merge, which you launch by opening the query to use as the primary table and then clicking Merge on the Query Editor ribbon. In the latter case, your workbook contains multiple queries.
Whichever option you use, you select the second table for the merge operation in the Merge dialog box and then select the matching column(s) for the primary and secondary tables, as shown in Figure 4.14. Power Query compares the data in both tables to determine how many rows will match, which in turn helps you judge the quality of the merge operation. By default, the Only Include Matching Rows check box is not selected, which means the query results might contain rows for which the columns from the secondary table are null when no match exists between the tables, similar to a left outer join in a relational query. However, if you select this check box, the query results contain only rows that match, which could be fewer than the number of rows in the primary table, similar to an inner join in a relational query.
FIGURE 4-14 The Merge dialog box displaying a selection of common columns for primary and secondary tables.
When you click OK in the Merge dialog box as a step in an intermediate merge operation, a new Query Editor window opens to display the query results for the new query. You must then expand the table link column—the last column in the data grid—to add the new columns in the grid alongside the columns from the primary table.
Instead of expanding an existing query by adding columns from a second query, you can expand the existing query by adding rows from a second query, which you can achieve by performing an append operation. As for a merge operation, you can choose an inline append or an intermediate append. Launch an inline append operation by clicking the table icon in the upper-left corner of the data grid and then selecting Append. To launch an intermediate append operation, click Append on the Power Query ribbon. Either way, launching the append operation opens the Append dialog box, in which you must specify the secondary table to append to the primary table. When you click OK, the query results are displayed in the data grid.