Earlier this year Microsoft released Microsoft R Server. This is essentially a rebranding of Revolution R Enterprise—acquired through Microsoft’s acquisition of Revolution Analytics in April 2015.
However, the fact that Microsoft is backing the product makes a big difference to many potential corporate users. And with Microsoft embracing R across the company, more investment in Microsoft R Server is surely in the cards.
Microsoft describe it as, R for the enterprise. It’s basically a suite of services/products, comprising
As you’d expect from Microsoft the installation is fairly straightforward. You have to install Microsoft R Open before installing Microsoft R Server. Both are standard MSI installers (on Windows).
Running Revolution R Enterprise 8.x (64) launches the R Productivity Environment (RPE). This IDE is similar to the excellent RStudio. One major difference is that result panes (such as plots) are displayed in floating windows. R Tools for Visual Studio is under development, which may become the primary Microsoft IDE for R.
RPE can be used to run all the standard R commands/packages. One of the benefits of using Microsoft R Server (or Microsoft R Open) for doing basic R work is that Microsoft has replaced some of the core libraries with high performance ones. This means that R functions that utilize basic core calculations, such as matrix multiplication, will run faster on Microsoft R Open than in open source R.
The first thing to note about working with big data in Microsoft R Server is that you can’t just run your standard R scripts and expect them to be magically mapped to a cluster. ScaleR provides a set of R functions designed to operate on a cluster. Most of the common statistical and machine learning techniques have been implemented, and the available functions will be added to over time.
ScaleR (on Hadoop—probably the most common big data framework) includes:
rxSummary
—basic summary statisticsrxLinMod
—fits a linear modelrxLogit
—fits a logistic regression modelrxGlm
—fits a generalized linear modelrxKmeans
—performs k-means clusteringrxDtree
—fits a classification or regression tree (using an algorithm developed by Ben-Haim and Yom-Tov)rxDForest
—fits a classification or regression decision forestrxBTrees
—fits a classification or regression decision forest using arxPredict
—calculates predictions for any fitted modelThese functions are designed specifically to work with (in this case) Hadoop clusters. Using them is as simple as calling an R function.
There are also general data manipulation functions and functions for controlling jobs Hadoop jobs and interacting with the HDFS file system.
Analyses using Microsoft R Server and Hadoop generally proceed as follows:
All the steps are covered in the RevoScaleR Hadoop Getting Started Guide). In this article we’ll briefly cover creating a data source and analysing the data.
Start with an HDFS object
hdfs <- RxHdfsFileSystem()
Then create the data source
dataSource <- RxTextData(file="/data/sales", missingValueString="?", fileSystem=hdfs)
To summarize the sales and profit figures use
rxSummary(~sales+profit, data=dataSource)
salesProfitLinearModel <- rxLinMod(sales~profit, data=dataSource)
Get the data you want to make predictions for
predictionDataSource <- RxTextData(file="/data/newSales", missingValueString="?", fileSystem=hdfs)
Predict the profit levels using the linear model and the new sales data
rxPredict(salesProfitLinearModel, data=predictionDataSource)
As you can see from this brief introduction, if you’re comfortable using R, Microsoft R Server gives you a direct route to big data analytics. If you’ve been doing statistical analysis and machine learning in R at the workstation level, the functions in ScaleR shouldn’t contain any surprises.
If you’re interested in big data analytics or R you may wish to consider the following Learning Tree courses: