Among SQL Server’s many new features introduced in 2016 is Polybase, a bridge between relational data stored in SQL Server and bulk data stored in the Hadoop ecosystem.
Hadoop is an excellent store for what many folks call “semistructured” data. That is, data which cannot be neatly broken up into the row-and-column structure of a conventional relational database. But there are times when the SQL language might be the best way to query data even in Hadoop. Polybase is a set of services that permits analysts using SQL Server 2016 to write queries on Hadoop data. Indeed, Hadoop data can be joined to SQL Server relational data in a single Transact-SQL query. This functionality is not integrated into SQL Server itself; it must be installed separately.
The installation program for SQL Server 2016 offers the Polybase services as an installation option. In the screenshot below, we are also installing the Advanced Analytics Extensions. These are the extensions required for the R services and, while useful with Polybase, are not required. For more information about the Advanced Analytics Extensions read here.
Hadoop itself, as well as the majority of closely allied services that are collectively referred to as the “Hadoop ecosystem” are largely written in Java. It is not surprising, therefore, that Polybase requires Java to communicate with and query Hadoop. If you do not already have an appropriate Java Runtime Engine (RTE) installed, the installation of Polybase will be unable to proceed.
While there are alternate open-source Java implementations available, Oracle’s Java is the gold standard. In my opinion, however, Oracle’s website does not make it particularly easy to find the java installer you want. You must be sure to install a 64-bit Java Runtime, version 7 update 51 or greater. I was successful using the installer file jre-8u66-windows-x64.exe.
Once you have installed Java 7 update 51 or higher, the Polybase installation should succeed. If you are running more than one instance of SQL Server, only one instance can have Polybase installed.
For testing purposes, I installed SQL Server as a standalone Polybase-enabled instance and accepted the default ports. Since this is a community technology preview, the effort of installing a scale-out group would have to be expended again when SQL Server 2016 is released to manufacturing and may not be worth the time.
Once the installation is complete, you should see two polybase services in the services applet. These services will shut down if you shut down the parent SQL Server, but the polybase services will not automatically restart when SQL Server is started up again. You must remember to restart them manually.This is important to remember, since the necessary server configuration changes to use Hadoop data will require restarting the server.
Even with Polybase installed, is not configured to permit Hadoop connectivity by default. Curiously, Hadoop connectivity and polybase mode are not considered advanced configuration options, and the current settings can be viewed like any other server configuration setting.
EXEC sp_configure ‘hadoop connectivity’;
As of CTP 3.2, there are only a few supported systems, and the target hadoop version must be specified in the configuration. I was unsuccessful getting Cloudera 5.5 (not on the list) to work. I suspect that Polybase makes calls to a Java API that is not universally supported. I was able to successfully query Hortonworks 2.2.4 on Linux.
EXEC sp_configure ‘hadoop connectivity’, 7;
reconfigure WITH OVERRIDE
Option 0: Disable Hadoop connectivity
Option 1: Hortonworks HDP 1.3 on Windows Server
Option 1: Azure blob storage (WASB[S])
Option 2: Hortonworks HDP 1.3 on Linux
Option 3: Cloudera CDH 4.3 on Linux
Option 4: Hortonworks HDP 2.0 on Windows Server
Option 4: Azure blob storage (WASB[S])
Option 5: Hortonworks HDP 2.0 on Linux
Option 6: Cloudera 5.1 on Linux
Option 7: Hortonworks 2.1 and 2.2 on Linux
Option 7: Hortonworks 2.2 on Windows Server
Option 7: Azure blob storage (WASB[S])
You will have to restart the server for the new configuration value to take effect. RECONFIGURE WITH OVERRIDE is not sufficient.
Of course, if you choose the wrong configuration value you will not get an error at this point; the errors will come later when you attempt to define an external table in SQL Server.You can download a free Hadoop “sandbox” virtual machine from Hortonworks if you are anxious to give Polybase a try.
Installing Polybase is straightforward, and there is little likelihood of any unpleasant surprises. However, you must have a 64-bit Java runtime engine installed at least as recent as version 7, update 51. I would encourage you to download the Hortonworks sandbox, even if you have your own Hadoop system. It can be a time saver to experiment with a Hadoop implementation that is known to work with Polybase before beginning trials with a different system.