SQL Server 2016 – Analytic Revolution with R: Part 3

Implementing a Data Science Workstation

coding-699318_640

SQL Server Architecture for R and Data Science

We have previously seen how to integrate R with SQL Server. While executing R scripts in stored procedures is tremendously valuable, the methods lacks the interactivity required by most analysts. To appreciate how SQL Server’s R services can provide this interactivity, we must consider the client /server architecture of an analytical system.

In Part 1, we saw that SQL Server requires a Revolution Enterprise node to execute R code. In a client/server architecture the workstation must be capable of launching R code on a different machine. Of course, that machine should be the one running SQL Server if we wish to analyze large query result set with best performance. To achieve this communication, the workstation must also have a Revolution Enterprise node installed.

The SQL Server node can, of course, run any R code, but that code must be installed on the server. Running it from a workstation is no more difficult than calling a stored procedure, as we have seen. The interactivity demanded by analysts is a bit more tricky to achieve. Revolution Enterprise nodes on physically separate can communicate with Microsoft’s implementation of MPI, the Message Passing Interface used in many of the world’s fastest massively parallel supercomputers. Special R functions provided by Revolution Analytics take advantage of this interface and can execute code on remote machines. “Regular” R code, that is, R code that does not use these special functions, executes on the local machine only. As we shall see, the R functions created to run remote code in the R service are collected in their own library, called RevoScaleR.

Installing the R Client Node in the Data Science Workstation

The installation package for a workstation is different (and larger) than the server installation package. It can be downloaded from https://www.microsoft.com/en-us/download/details.aspx?id=50351. As with the server node, you should first install the Revolution R Open 3.22. After installing this conventional R GUI provided by Revolution, you can step through the installation process for the enterprise node. The only real decision during install is whether to have R automatically load the valuable packages rpart and lattice. There seems to be little reason not to. The enterprise node includes a second R IDE, called “RevolutionR Enterprise,” based on the Visual Studio shell and designed especially for remote R execution.

Creating a RevoScaleR Script

A script for remote execution differs from a regular R script in two important ways. A “compute context” must be defined; this gives R the necessary information to identify and connect to the remote node where code is to be executed. Secondly, only functions in the RevoScaleR library are capable of remote execution and all RevoScaleR functions begin with the prefix “rx.” Normal R functions will execute in the memory of the workstation regardless of whether a remote compute context is available. In some cases, there are remote-enabled “rx” versions of standard R functions.

We begin by requiring the RevoScaleR library:

require(RevoScaleR)

An ODBC connection string is defined, which, can be used for both establishing the compute context and also for querying data. The string for querying the data need not be the same as the compute context.

Note: In these tests the remote SQL Server and the data science workstation were both VMWare virtual machines running in the same host. This puts some constraints on connection parameters. On a regular network, Windows authentication is preferable and the parameter Network=dbnmpntw , which specifies the use of named pipes, should not be necessary.

sqlConnString <- “Driver=SQL Server;Server=192.168.206.133; Database=ContosoRetailDW;Network=dbnmpntw;Uid=sa;Pwd=sapw”

The compute context requires a directory for file storage of temporary data structures would that be necessary.

sqlShareDir <- paste(“c:\\Temp\\”, Sys.getenv(“USERNAME”), sep=“”)

Finishing up the definition of the compute context:

sqlWait <- TRUE

sqlConsoleOutput <- FALSE

sqlCompute <- RxInSqlServer(connectionString = sqlConnString,shareDir = sqlShareDir,wait = sqlWait, consoleOutput = sqlConsoleOutput)

Once the required information is defined, the compute context itself is created:

rxSetComputeContext(sqlCompute)

The newly defined compute context doesn’t actually do anything yet; it comes into play later when functions to operate on remote data are actually invoked.

If need be, the compute context can be changed. The context can be reset to the default using

rxSetComputeContext(“local”)

The currently active compute context is shown in the status bar:

StatusBarReady

Here’s our plain vanilla R in preparation for querying data:

queryString <- “SELECT * FROM AsianSalesAmounts”

“AsianSalesAmounts” is a view created for the purpose of avoiding script clutter resulting from the actual SELECT statement. The view returns a single column from the Contoso FactSales table, SalesAmount. We can now define a data frame:

inDataSource <- RxSqlServerData(sqlQuery = queryString, connectionString = sqlConnString)

The rxSummary function was added to the script simply to illustrate the remote execution

The status bar of Revolution R Enterprise tells us that the remote server is busy:

StatusBarBusy

We can use the Windows Task Manager for crude confirmation that the R code was executing remotely.

SQL Server R Services VM shows activity from R code execution.
SQL Server R Services VM on left shows activity from R code execution.

The last line of the script illustrates a very important point about the interactivity of Revolution R Enterprise. Building a histogram can be an intensive task for larger datasets; we can build the histogram on the server but the workstation is happy to display the plot for us.

rxHistogram(~SalesAmount, rowSelection= SalesAmount <= 30000, data = inDataSource)

Script3Histogram

Reprise: The Entire R Script

require(RevoScaleR)

sqlConnString <- “Driver=SQL Server;Server=192.168.206.133; Database=ContosoRetailDW;Network=dbnmpntw;Uid=sa;Pwd=sapw”

sqlShareDir <- paste(“c:\\Temp\\”, Sys.getenv(“USERNAME”), sep=“”)

sqlWait <- TRUE

sqlConsoleOutput <- FALSE

sqlCompute <- RxInSqlServer(connectionString = sqlConnString,shareDir = sqlShareDir,wait = sqlWait, consoleOutput = sqlConsoleOutput)

rxSetComputeContext(sqlCompute)

queryString <- “SELECT * FROM AsianSalesAmounts”

inDataSource <- RxSqlServerData(sqlQuery = queryString, connectionString = sqlConnString)

rxGetInfo(inDataSource, getVarInfo = TRUE)

rxSummary(formula= ~SalesAmount ,data=inDataSource)

rxHistogram(~SalesAmount, rowSelection= SalesAmount <= 30000, data = inDataSource)

Conclusion

The ability to embed R scripts in T-SQL benefits the dashboard methods of business intelligence. A fixed set of procedures can execute against changing data and provide current results to a tool like Microsoft Excel or the Reporting Services. The client/server approach is more suited to EDA, or exploratory data analysis. In this approach, analysts need the flexibility to dissect data in a variety of different ways and rapidly change direction based on what they may find. The collaboration of the Revolution Enterprise R workstation with SQL Server seems an excellent platform for accomplishing the goals of EDA.

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.