A Problem with R

Data Batching- R Programming

 

The value of R lies in the enormous quantity of code contributed by analysts and academic researchers over many years, providing a packaged solution not only for common analytical techniques but also the esoteric and the obscure. The problem with R, and one that concerns many analysts dealing with large data volumes, is that R still relies on a traditional batch processing mode, with a single thread and all the data in RAM. While this suits the needs of many statisticians and researchers just fine, today’s ever-increasing data loads can easily exceed the resources of even the largest computing hardware.

It is not too difficult to find a dataset that will halt R with an out-of-memory error. Even if fast hardware and lots of RAM avoids an error, loading an entire dataset into memory all at once can reduce performance to unacceptably poor levels. One approach, exemplified by the Hadoop ecosystem, is to distribute the computation across multiple machines, and indeed Microsoft R Services has several options for accomplishing this. A second approach, and the one we’ll describe here, is to break the data into chunks and then develop algorithms that work on only one chunk at a time, negating the requirement of reading all the data into memory at once.

The XDF File

Revolution Analytics, now wholly owned by Microsoft, attacked this problem by developing a flexible new file format for data that they have dubbed “XDF”. Much like a conventional database file, the structure of an XDF file permits subsets of the data to be retrieved from the disk file without having to read the disk file into RAM as a single unit. XDF files are not available in Microsoft R Open; they are implemented in the RevoScaleR libraries and are only available in the Microsoft R Services.

RevoScaleR provides functions to create XDF files from a variety of inputs. Most published examples focus on csv file input, but input files can be SPSS files and SAS files, among others. Inputs are not restricted to files, and query results can be read directly from relational data sources into XDF format.

XDF Files from CSV

We will take a brief look at creating XDF files from csv file input, but will not focus on that because many examples are provided in the Microsoft R documentation. We start with a file fs.csv, which included all the rows and a subset of columns from the FactSales table in the ContosoRetailDW sample database. The following R command builds an XDF file and returns a reference that can be used for further examination and processing of the file if desired.

temp = rxImport(inData=”C:/Temp/fs.csv”, outFile=”C:/Temp/fsFromCSV.xdf”, rowsPerRead=100000)

We can see a substantial reduction in the file size resulting from the default compression.

CSV and XDF File Size

It is important to remember that the XDF advantage goes well beyond simple compression when using functions that process data from XDF files in chunks rather than working on the entire dataset at once.

XDF Files from Queries

Clearly, in many cases it will be easier and faster to read data into XDF files directly from a relational data source. Curiously, in my experience less compression is obtained from the default algorithm, even though the row data is identical.

require(RevoScaleR)

require(RODBC)

ch01<-odbcDriverConnect(“DRIVER={SQL Server Native Client 11.0};SERVER=localhost\\RTest;Database=ContosoRetailDW;Trusted_Connection=Yes”)

ptm = proc.time()

test<-rxImport(sqlQuery(ch01,”SELECT * FROM FactSalesByDate(‘2007′,’2010’)”),outFile=”C:\\Temp\\fsFromQuery.xdf”, append=”none”, overwrite=TRUE)

proc.time() – ptm

For this set of experiments I created a table-valued function to keep the inline query in the R code small, but this step is for convenience. Any valid SQL query is fair game. On my test VM, this query took about 30 seconds to run in the Management Studio and about 75 seconds to run while creating the XDF file in R.

A Quick Aside

Notice that in the csv example and the query example two different methods were used to specify the file path. The forward slash works as is; if you wish to use the backslash you must ‘escape’ if with a second backslash.

The Resulting XDF File

File Size for XDF files prepared from csv files and from SQL queries

The XDF file created from the query is significantly larger; the cause of this is not entirely clear. The rxGetInfo function can be used to examine the characteristics of the XDF file and the variables within:

rxGetInfo(“C:/Temp/fsFromCSV.xdf”, getVarInfo = TRUE)

When converting from CSV or from a query, there is good chance that datatype conversion decisions will be made differently. In this case, the numeric (as opposed to integer) data was converted to a float32 when read from the CSV and remained numeric when read from the query results. This difference, however does not account for the entire disparity between the XDF files obtained from the two different methods.

Appending XDF Files

Once an XDF file is created, additional resultsets with the same column structure can be appended to it, whether the results are from another csv file, another query, or another XDF file. This is accomplished by setting the value of the append parameter to “rows” rather than the default “none.” In this case, the XDF file must actually exist; the rxImport function will not create a new file for appending.

temp <- rxImport(“C:/Temp/fs2009.xdf”, outFile = “C:/Temp/fs.xdf”, append = “rows”)

Conclusions

XDF files are an effective weapon in the ongoing battle between data and available RAM. Familiarity with XDF files will be a necessity for any data scientist adopting the Microsoft Enterprise R platform, True geeks who wish to learn more about how disk files can be used as extended memory for statistical algorithms can read “Algorithms and Data Structures for External Memory” by Jeffrey Scott Vitter.

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.