Three tasks are necessary if we are to query Hadoop data with Polybase. We must define the connection information, we must define how the semi-structured Hadoop data is to be parsed, and we must specify the row-and-column format for the parsed data that we will use to write SQL queries. In Polybase, these critical definitions correspond to a data source, a file format, and an external table.
The data source definition is fairly straightforward. At present, only the hdfs protocol is supported. 8020 is Hadoop’s default port for this protocol.
IF EXISTS (
WHERE name = N’TestDataSource_01′
DROP EXTERNAL DATA SOURCE TestDataSource_01
CREATE EXTERNAL DATA SOURCE TestDataSource_01 WITH
TYPE = Hadoop
— webhdfs not supported, http not supported
, LOCATION = ’hdfs://192.168.169.132:8020’
–, RESOURCE_MANAGER_LOCATION = ’192.168.169.128:50050’ — optional
–, CREDENTIAL = <credential_name, sysname, sample_credential>
The SQL language was defined for data with a row-and-column structure. This structure must exist even in Hadoop data if we are to query it using Polybase. As of SQL CTP 3.2 the end of rows in the Hadoop files must be marked with a carriage return. No other character will do. There is more flexibility with the column separator. However, if you are just beginning to experiment with Polybase it would be wise to stick with a comma-separated value (CSV) format. Once again, this is standard and works; we can leave experimentation with less common formats until after our new system has been successfully tested.
It is probably also wise to include the double-quote character as the text delimiter. Specifying the double-quote as the text-delimiter does not require every row to have this delimiter, but it does avoid some problems with other special characters simply by putting the text value for that column between quotes.
IF EXISTS (
WHERE name = N’csv’
DROP EXTERNAL FILE FORMAT csv
CREATE EXTERNAL FILE FORMAT csv WITH
FORMAT_TYPE = DELIMITEDTEXT
–,SERDE_METHOD = N'<serde_method, nvarchar(255), sample_serde_method>’
FIELD_TERMINATOR = N’,’
,STRING_DELIMITER = N'”’ — not all sting columns must be enclosed in quotes. Only those that contain special characters, like commas
–,DATE_FORMAT = N'<date_format, nvarchar(50), sample_date_format>’
,USE_TYPE_DEFAULT = TRUE
–, DATA_COMPRESSION = N'<compression_method, nvarchar(255), sample_compression_method>’
–, DATA_COMPRESSION = ’org.apache.hadoop.io.compress.DefaultCodec’
In this example, we use the movies.csv file from the movie lens example data set. There are rows in this data, that will cause our parser to choke. This will not become evident when the table is defined. Only when the table is queried will SQL Server attempt to read the Hadoop file and throw a run time error. The original data set is available here: http://grouplens.org/datasets/movielens/
In this example, the first column of the csv file consists of numerals representing integers. However, SQL Server will not successfully convert these numerals into integers, so the first column must be declared as some character datatype.
CREATE EXTERNAL TABLE dbo.Movies (
movieId varchar(200), — cannot set to INT
title varchar(200) NULL ,
genres varchar(200) NULL
LOCATION=’/user/movielens/movies.csv’ — hdp sandbox
, DATA_SOURCE = TestDataSource_01
, FILE_FORMAT = csv
— DROP EXTERNAL TABLE dbo.movies
— DROP TABLE dbo.movies — will not work, you must use DROP EXTERNAL TABLE
The objects we have just created are available in the SSMS Object Explorer.
Once the table is created, we are ready to write a SELECT query. It is only here that problems with the declared text format or the format of the data itself will be discovered. A variant of the movies.csv file in which I have fixed the offending is available here. In my opinion, rows that cannot be successfully parsed should be copied to a log file, or perhaps returned as messages to the client, but should not cause the query to fail. Unfortunately, this is not the case, placing fairly stringent constraints on the Hadoop data that can be queried using T-SQL.
Here are some examples of working queries:
SELECT TOP 10 * FROM dbo.movies
SELECT MovieId, Title, Genres FROM dbo.movies –not case sensitive
SELECT movieId, title FROM dbo.movies WHERE title LIKE ’%1998%’
SELECT COUNT(*) FROM dbo.movies WHERE title LIKE ’%1998%’
It is interesting to note that the TOP 10 query above returns the first 10 rows, as it should, but that the entire table is read from Hadoop.
By making Hadoop data available to T-SQL queries, Polybase promises to greatly expand the power of SQL Server. However, the stringent requirements placed on the Hadoop data store can greatly limit some practical applications of this technology.