In Part 1, we discussed the installation of Revolution R, and confirmed that the SQL Server integration worked with a simple test.
@language = N’R’,
@script = N’R_data_frame <- InputDataSet’,
@input_data_1 = N’SELECT ”Hello, World!”’,
@output_data_1_name = N’R_data_frame’
WITH RESULT SETS((MyColumn char (13)))
Now let’s take a closer look at this example. The R script simply takes a string and assigns it into a variable, in this R_data_frame. The argument @output_data_1_name is the name of the variable in our script (in this case there is only one) that we would like returned as a conventional SQL Server rowset.
The WITH RESULT SETS clause defines the name and datatype of the T-SQL resultset that will be generated; in this case a single column of type CHAR(13).
Now let’s take a look at a more interesting example. We will use the ContosoRetailDW sample database and will create a view simply to keep the R script as simple as possible. The view simply provides a list of SalesAmount values from the FactSales table.
CREATE VIEW dbo.AsianSalesAmounts
FROM FactSales fs
JOIN DimStore ds
JOIN DimGeography dg
Now we will confirm the usefulness of having the R GUI on the server. In this case, we would like to use the convenient describe( ) function to obtain summary statistics. The describe( ) function is defined in the psych library, which is not installed with R by default. After choosing a mirror website to download this library, we choose Packages | Install pac age(s) … from the menu and find “psych: on the rather long list of available libraries.
Now we are ready to go. Once again we call sp_execute_external_script.
@language = N’R’,
@script = N’
require(psych) # psych library is not installed by default
df <- InputDataSet
@input_data_1 = N’SELECT * FROM AsianSalesAmounts’,
@output_data_1_name = N’results’
WITH RESULT SETS ((vars INT, n INT, mean FLOAT, sd FLOAT, median FLOAT, trimmed FLOAT, mad FLOAT, min FLOAT, max FLOAT, range FLOAT, skew FLOAT, kurtosis FLOAT, se FLOAT))
The script places the query results into the dataframe called “df”, which is then provided as an argument to the describe( ) function. The output of describe( ) is captured in the variable “results” which is named as the @output_data_1_name parameter for the stored procedure. Here we see yet another reason why we need the R GUI. We must describe the columns and their datatypes for the WITH RESULT SETS clause, and the most direct of determing what we need is simply running the script in the GUI and noting the column names. Of course, we can use whatever names for the output columns we want; we are not restricted to the names applied by the R functions. We are also free to take liberties with the datatype; in this case FLOAT was chosen over DECIMAL to avoid hard-coding the necessary precision (precision decisions?).
Running the batch in the Management Studio, we get the following results:
We can use this opportunity to take a closer look at what actually happens when you run an R script in T-SQL.
Here is a look at the Launchpad service in Mark Russinovich’s valuable Process Explorer tool:
If we watch the process explorer while running the T-SQL script we can see processes launch; these processes will terminate when the embedded R script completes.
We have seen that it is surprisingly easy to run R scripts within a T-SQL batch. Of course, in the real world, we would not wish to write specific batches for each task, but would rather embed the R scripts in stored procedures which have the flexibility of being called with different arguments. In the next installment, we will see how to set up a data science workstation so analysts can interactively use the SQL Server R service with SQL Server data.