The RevoscaleR functions provided by Microsoft in Machine Learning Server and MS R Client are many and varied, as are the parameters are passed to control their behavior. But while it might seem a bit overwhelming at first, it is really quite simple. Except for the fact that ETL generally reads and writes to multiple locations, many of the RevoScaleR function parameters are directly analogous to the ETL tasks with which you are already familiar.
Fortunately, many parameters are common to a large number of RevoscaleR functions; we only need learn them once and can apply them to many functions. Others, of course, must be more specific. In our quick look, we will use as our example rxDataStep, which is an invaluable intermediate processing step that can sit between the original data and its final application.
Data can come from a wide variety of sources: commonly files and database sources. When data is read, there are several important questions to be answered: Do you want all the rows or a subset? Do you want all the columns? Should rows with missing data be skipped?
Input rows can be filtered with logical expressions
There are two options for rows with missing data. If removeMissingsOnRead is set to TRUE, rows with missing data will not be read. If removeMissings is TRUE, rows with missing data are read, and are therefore available for some summary data, but are not included in the output.
If you only want a few columns out of many, you can specify a list of “varsToKeep”. Conversely, if you want most of the columns, you can specify a list of “varsToDrop”. Using one or the other of these parameters is very important, since we always wish to use our available memory as efficiently as practical.
Data frames have been the natural structure for data in R since R first began. By definition, a data frame is a structure that must fit entirely in RAM. In Microsoft ML Server, this limitation can be circumvented by the use of XDF files. The key feature of XDF files that makes them so valuable in R is that the internal structure is organized in blocks that can be read and processed individually. The totality of the data may be too much for RAM, but each of 10 smaller blocks of data may be perfectly fine.
Microsoft uses the word “chunking” to refer to the processing of individual blocks in RAM.
Microsft R Client is also capable of reading and writing XDF files, but cannot utilize blocking and chunking with XDF files to divide data into manageable size chunks for analysis.
Assigning a number to rowsPerRead will create the output in blocks, which can be processed one-at-a-time in preference to reading the entire dataset into RAM. Remember that this is only possible with the server version. R Client can only deal with dataframes loaded entirely into RAM.
If a positive value is assigned to rowsPerRead, then blocksPerRead is ignored. blocksPerRead provides a way of “re-blocking” an output Xdf file. For example, if blocksPerRead is set to 2 (assuming we don’t do anything else) then the output file will have half as many blocks as the input, each twice as large.
With standard ETL techniques, data transformation is a distinct step. However, in RevoscaleR functions the transformation can take place within the function itself. One of the most important reasons for this is that, with large data sets, it may be possible to improve performance substantially if many operations can be performed on a single pass through the data by the RevoScaleR function.
The transformation step can add columns and redefine existing columns. Standard R functions can be used or custom functions can be applied. We shall consider the details of RevoScaleR transformation in an upcoming blog.
Right now, the most important thing is to emphasize that we want to apply all of the transformations in a single function call adhering to an important principle of performance:
Perform multiple transformations on a single pass through the data rather than read the same data multiple times.
For most RevoScaleR functions, the output of the reading and transformation will be passed directly to some data analysis step, perhaps a multiple linear regression. However, rxImport and rxDataStep can save their output. Often, the preferred output format will be an Xdf file, but this is not a requirement. Virtually any of the data sources from which we can read data can also serve as a destination for the processed results.
When we are saving the results of processing, there are two parameters we might set:
Generally, you may wish to save the output to an Xdf file if the filtering and transformation would otherwise need to be done often. If you are frequently applying the same filters and the same transformations in preparation for, say, a neural net or a decision tree, then you might wish to prepared a new Xdf file in which those changes are stored and the results reused.
The rxImport function is a close cousin to rxDataStep. Indeed, there is a great deal of overlap in their functionality. In general, Microsoft suggests using rxImport to import data from heterogeneous sources and use rxDataStep to manipulate dataframes and Xdf files. A feature of rxImport that might prove valuable when importing data is the abilitiy to generate composite sets of Xdf output files, in which the data is spread across multiple Xdf files rather than concentrated in one. Clearly, this can be useful for architectures using distributed file systems.
RevoScaleR functions must support many parameters to achieve the power and flexibility required in data analysis. However, when we focus on the core essentials, we see that the parameters for reading, transforming, and writing data are simple, flexible, and easy to apply.