How to Use Power Query to Import Hadoop Data into Excel

When Power Query was first introduced early in 2013 it was known as the Data Explorer. In some ways, Data Explorer was a better name. The primary job of Power Query is to enable Excel users to examine data, decide what values need to be imported into Excel, and then complete the import process. If one wishes to create queries for data in SQL Server or the Analysis Services, there are many far better business intelligence tools than Power Query. Where Power Query shines is its ability to retrieve data from a wide variety of different sources and then consolidate this data into a workbook or an Excel 2013 data model.

What is Power Query

Power Query is table-oriented, meaning that it is designed to grab rows and columns from different data sources. If you would like to retrieve the results of a multiple-table join of SQL Server data, it would be both easier and faster to craft a SELECT statement using the Management Studio and then use it to import the data of interest. But what if you wanted to retrieve some data from a Hadoop database? The classic methods of importing data into Excel simply have no provision for accomplishing this task.

How to Import Data into Excel – A Power Query Tutorial

Let’s take a quick look at how we can bring Hadoop “big data” into an Excel 2013 data model.

Like Power Pivot, Power Query is a VSTO (Visual Studio Tools for Office) Add-In. However, unlike Power Pivot you must download it independently and tell Excel to load it as a COM Add-In. You can download Power Query here.  As with all COM Add-Ins, it can be loaded from the Options | Add-Ins.

Once Power Query is loaded, we see there are several choices for importing data on the Power Query ribbon menu. We will choose “From Other Sources”.

Access data with Power Query
Power Query can access data from a wide variety of sources, including the Hadoop file system.

After choosing from Hadoop filesystem, we will be presented with a dialog asking for the network address of the Hadoop name node. Either a name or an IP address should do, but myself and others have encountered a problem here. More will be said about this shortly. On the virtual machine I set up testing, the network name of the Hadoop name node worked fine, so I will illustrate the IP address here.

Connecting to the Hadoop Name Node
Connecting to the Hadoop Name Node

We see the the list of files on the Hadoop server is large, and most are not of interest to us. Fortunately, we can filter this list, just like we can filter any list in Excel itself. In this case, we will limit our list to files with the “csv” extension.

Files in the Hadoop file system
Files in the Hadoop file system
Power Query Interface
The user interface for filtering data is identical to the familiar Excel interface
Managing the file list in Power Query
The file list is more manageable now that we are only looking at the csv files of movie data

Notice that the second column in the resulting list contains a link labeled “binary”. Clicking on “binary” should load some of the data into the query editor. Rather than load a potentially enormous data set, Power Query only loads enough data to display, behavior consistent with its original name “Data Explorer”. If you attempt to scroll down, you will see Power Query loading more rows as you go.

When Power Query Doesn’t Get it Right!

In this example, however, clicking on “Binary” fails to load the data.

Why "Unable To Connect" Error
The “Unable To Connect” error message is not particularly informative

Sometimes Power Query doesn’t get it quite right. Internally, Power Query constructs an internal url “get” request to obtain the HDFS file data via the web interface. In this example, Power Query failed to get the url correct when the IP address was used. Other individuals have reported similar problems with the network name failing. This can be fixed by adding an entry in the “hosts” file, that is, system32\drivers\etc\hosts. You will need to make this change using administrator privilege.

New line added to hosts file.
New line added to hosts file.

Now it should work just fine.

Query Editor With Movie Data
Now we can see the data we want!

Importing the Data

Once we have successfully retrieved row data by clicking on “binary”, we may decide to import the data. Here we must be careful. By default, Power Query will attempt to import the data into an Excel worksheet, and is therefore limited to one million rows.

If your dataset is small, why is it in Hadoop?

Chances are you will want to import your data into Power Pivot rather than into a worksheet. Clicking on the Load and Close button in the Power Query editor will give you the option of accepting the default loading or specifying where you want the data.

Menu for loading data from Power Query into Excel
Menu for loading data from Power Query into Excel
Loading the data into a worksheet or into Power Pivot
Where should the data go? Loading the data into a worksheet or into Power Pivot.

We do not want the data in a worksheet, so we choose create connection only. We do want the data in Power Pivot, so we choose “Add to Data Model”. After the data has loaded, we can confirm a successful load by opening the Power Pivot window. We see not only that the data is there, but we also observe that the name of the query in Power Query becomes the table name in Power Pivot, in this case the insipid “Query3”.

Query Results In Power Pivot
Query Results In Power Pivot

Conclusion

Although it does not have an intuitive interface as do many Microsoft products, Microsoft Power Query provides a practical means for integrating workbooks with data from a Hadoop “Big Data” store.

To learn more about Power Query & other business intelligence tools, consider attending one of the following courses –

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.