When Power Query was first introduced early in 2013 it was known as the Data Explorer. In some ways, Data Explorer was a better name. The primary job of Power Query is to enable Excel users to examine data, decide what values need to be imported into Excel, and then complete the import process. If one wishes to create queries for data in SQL Server or the Analysis Services, there are many far better business intelligence tools than Power Query. Where Power Query shines is its ability to retrieve data from a wide variety of different sources and then consolidate this data into a workbook or an Excel 2013 data model.
Power Query is table-oriented, meaning that it is designed to grab rows and columns from different data sources. If you would like to retrieve the results of a multiple-table join of SQL Server data, it would be both easier and faster to craft a SELECT statement using the Management Studio and then use it to import the data of interest. But what if you wanted to retrieve some data from a Hadoop database? The classic methods of importing data into Excel simply have no provision for accomplishing this task.
Let’s take a quick look at how we can bring Hadoop ”big data” into an Excel 2013 data model.
Like Power Pivot, Power Query is a VSTO (Visual Studio Tools for Office) Add-In. However, unlike Power Pivot you must download it independently and tell Excel to load it as a COM Add-In. You can download Power Query here. As with all COM Add-Ins, it can be loaded from the Options | Add-Ins.
Once Power Query is loaded, we see there are several choices for importing data on the Power Query ribbon menu. We will choose ”From Other Sources”.
After choosing from Hadoop filesystem, we will be presented with a dialog asking for the network address of the Hadoop name node. Either a name or an IP address should do, but myself and others have encountered a problem here. More will be said about this shortly. On the virtual machine I set up testing, the network name of the Hadoop name node worked fine, so I will illustrate the IP address here.
We see the the list of files on the Hadoop server is large, and most are not of interest to us. Fortunately, we can filter this list, just like we can filter any list in Excel itself. In this case, we will limit our list to files with the ”csv” extension.
Notice that the second column in the resulting list contains a link labeled ”binary”. Clicking on ”binary” should load some of the data into the query editor. Rather than load a potentially enormous data set, Power Query only loads enough data to display, behavior consistent with its original name ”Data Explorer”. If you attempt to scroll down, you will see Power Query loading more rows as you go.
In this example, however, clicking on ”Binary” fails to load the data.
Sometimes Power Query doesn’t get it quite right. Internally, Power Query constructs an internal url ”get” request to obtain the HDFS file data via the web interface. In this example, Power Query failed to get the url correct when the IP address was used. Other individuals have reported similar problems with the network name failing. This can be fixed by adding an entry in the ”hosts” file, that is, system32\drivers\etc\hosts. You will need to make this change using administrator privilege.
Now it should work just fine.
Once we have successfully retrieved row data by clicking on ”binary”, we may decide to import the data. Here we must be careful. By default, Power Query will attempt to import the data into an Excel worksheet, and is therefore limited to one million rows.
Chances are you will want to import your data into Power Pivot rather than into a worksheet. Clicking on the Load and Close button in the Power Query editor will give you the option of accepting the default loading or specifying where you want the data.
We do not want the data in a worksheet, so we choose create connection only. We do want the data in Power Pivot, so we choose ”Add to Data Model”. After the data has loaded, we can confirm a successful load by opening the Power Pivot window. We see not only that the data is there, but we also observe that the name of the query in Power Query becomes the table name in Power Pivot, in this case the insipid ”Query3”.
Although it does not have an intuitive interface as do many Microsoft products, Microsoft Power Query provides a practical means for integrating workbooks with data from a Hadoop ”Big Data” store.
To learn more about Power Query & other business intelligence tools, consider attending one of the following courses –