mar 11, 2016

Earlier this year Microsoft released Microsoft R Server. This is essentially a rebranding of Revolution R Enterprise—acquired through Microsoft’s acquisition of Revolution Analytics in April 2015.

However, the fact that Microsoft is backing the product makes a big difference to many potential corporate users. And with Microsoft embracing R across the company, more investment in Microsoft R Server is surely in the cards.

Microsoft describe it as, R for the enterprise. It’s basically a suite of services/products, comprising

- Microsoft Open R—stable version of open source R with some high performance math libaries
- DevelopR—Windows IDE for developing R applications
- DistributedR—cluster computing framework for big data analytics
- ScaleR—package of R functions that support statistical analysis and machine learning on big data
- ConnectR—provides facilities to connect to a range of big data sources
- DeployR—web services SDK for integrating R with other services

As you’d expect from Microsoft the installation is *fairly* straightforward. You have to install Microsoft R Open before installing Microsoft R Server. Both are standard MSI installers (on Windows).

Running Revolution R Enterprise 8.x (64) launches the R Productivity Environment (RPE). This IDE is similar to the excellent RStudio. One major difference is that result panes (such as plots) are displayed in floating windows. R Tools for Visual Studio is under development, which may become the primary Microsoft IDE for R.

RPE can be used to run all the standard R commands/packages. One of the benefits of using Microsoft R Server (or Microsoft R Open) for doing basic R work is that Microsoft has replaced some of the core libraries with high performance ones. This means that R functions that utilize basic core calculations, such as matrix multiplication, will run faster on Microsoft R Open than in open source R.

The first thing to note about working with big data in Microsoft R Server is that you can’t just run your standard R scripts and expect them to be magically mapped to a cluster. ScaleR provides a set of R functions designed to operate on a cluster. Most of the common statistical and machine learning techniques have been implemented, and the available functions will be added to over time.

**ScaleR (on Hadoop—probably the most common big data framework) includes:**

`rxSummary`

—basic summary statistics`rxLinMod`

—fits a linear model`rxLogit`

—fits a logistic regression model`rxGlm`

—fits a generalized linear model`rxKmeans`

—performs k-means clustering`rxDtree`

—fits a classification or regression tree (using an algorithm developed by Ben-Haim and Yom-Tov)`rxDForest`

—fits a classification or regression decision forest`rxBTrees`

—fits a classification or regression decision forest using a

stochastic gradient boosting algorithm`rxPredict`

—calculates predictions for any fitted model

These functions are designed specifically to work with (in this case) Hadoop clusters. Using them is as simple as calling an R function.

There are also general data manipulation functions and functions for controlling jobs Hadoop jobs and interacting with the HDFS file system.

**Analyses using Microsoft R Server and Hadoop generally proceed as follows:**

- Start Microsoft R Services
- Specify the Hadoop NameNode
- Create a compute context for Hadoop
- Create a data source
- Summarize your data
- Fit a model to your data
- Make predictions using the model

All the steps are covered in the RevoScaleR Hadoop Getting Started Guide). In this article we’ll briefly cover creating a data source and analysing the data.

Start with an HDFS object

`hdfs <- RxHdfsFileSystem()`

Then create the data source

`dataSource <- RxTextData(file="/data/sales", missingValueString="?", fileSystem=hdfs)`

To summarize the sales and profit figures use

`rxSummary(~sales+profit, data=dataSource)`

`salesProfitLinearModel <- rxLinMod(sales~profit, data=dataSource)`

Get the data you want to make predictions for

`predictionDataSource <- RxTextData(file="/data/newSales", missingValueString="?", fileSystem=hdfs)`

Predict the profit levels using the linear model and the new sales data

`rxPredict(salesProfitLinearModel, data=predictionDataSource)`

As you can see from this brief introduction, if you’re comfortable using R, Microsoft R Server gives you a direct route to big data analytics. If you’ve been doing statistical analysis and machine learning in R at the workstation level, the functions in ScaleR shouldn’t contain any surprises.

If you’re interested in big data analytics or R you may wish to consider the following Learning Tree courses: