Enthusiastic data miners are excited about the new possibilities which have opened up with Microsoft R. Not everyone, however, has realized that the so-called “in-database” R is not the best choice for every application of R to SQL Server data. One database administrator expressed concern about R competing with SQL Server for precious RAM. A valid concern, but in a sense it misses the point. Since in-database R will certainly compete for RAM, it is important to use this architecture only where the pros clearly outweigh the cons. Let’s take a look at the different fundamental architectures that can be applied to an SQL Server/Microsoft R system.
Calling R scripts from a stored procedure, or any T-SQL for that matter, has caught many people’s attention, but this technique is suitable only for specialized tasks. This would not be a wise choice for any sort of exploratory data analysis using R Rather, this technique is appropriate for tasks where the rapid analysis of new data is critical and always with a clearly defined algorithm. One common example is association analysis, often called “market basket” analysis. Using T-SQL to pass new order information to an R analysis script could rapidly provide shopping recommendations as well as add the new data to the recommendation model.
Another example where new data would benefit from in-database R is the detection of potentially fraudulent transactions. Outliers that indicate unusual transactions can be rapidly identified by in-database R.
It seems reasonable to assume that a great many applications of R to SQL Server data will involve analysts extracting summaries from large sets of data. In this case, an R node installed on the analyst’s workstation would submit commands to the in-database R. No stored procedures need call R scripts in such a system. The primary advantage lies in the fact that there would be no necessity for moving large datasets across the network. At the workstation’s command, large datasets could be processed on the server side and only the reduced or summarized data need be returned to the workstation.
Ironically, in the enthusiasm for in-database R with SQL Server, many have forgotten that the oldest techniques are often still the best. If the data volumes are not large, an analyst can simply query the data from R just as any client application queries data. Although ODBC is ancient technology by today’s standard, for the core task of what it was designed to do, query relational data, ODBC remains a terrific library. JDBC is a viable alternative to ODBC and has one especially attractive benefit. Many data scientists have come to appreciate the dplyr library, which uses JDBC, for its ease of use.
Of course, classic client-server style access of SQL Server data could be done with any implementation of R and would not require Microsoft R Services. However, Microsoft R Services provides a very interesting and enormously useful option. Rather than copy rowsets of data into workstation memory in preparation for analysis, the workstation asks the server to copy the data into an XDF file. Thanks to compression algorithms, which can be selected to balance time versus size, XDF files are substantially smaller than the corresponding query resultset. And, since XDF files need not be read into RAM in their entirety, this method enables Microsoft R to work on very large datasets one manageable chunk at a time, avoiding the necessity of maintaining the entire dataset in RAM all at once. The XDF format as well as the specialized algorithms that analyze xdf data was designed at Revolution Analytics, now part of Microsoft. XDF files are not available to users of R implementations other than Microsoft R Services.
For analysts who prefer dplyr, the Revolution Analytics folks have prepared an additional layer, dplyrXdf, which integrates the use of dplyr with XDF files.
As is the case with any software system, choosing an architecture appropriate to the problem is an important factor for success. For example, “recommender” systems using R will benefit from Microsoft R in ways quite different than analysis applying ad-hoc data mining techniques and exploratory data analysis.