Better Performance for R Functions with Vectorization

R-language-logo-224x136

One of the best things about R is that so many talented people have contributed a multitude of  valuable packages. In fact there are so many, that for most tasks you don’t have to do any programming at all. However, there are always going to be problems that require custom treatment and that will demand some custom programming. R is rarely anyone’s first programming language, so many people bring to R experience they have gained in programming languages such as C# and Java. Unfortunately, some very fundamental technique that are fine for C# and Java can be inferior performers in the R environment. Most notable among these is the use of loops. In R, loops will, in general, perform more poorly than the same logic expressed using vectors.

To illustrate this, we will choose some simple problems from mathematics rather than statistics, so we can focus on the loop logic itself. We will start by doing a simple sum. We’ll use the harmonic series for our example, but the performance principles apply equally to any sum.

Of course, we will need to measure performance. We could do this ourselves using system.time, but our recurring theme is that someone else has already done it in R (and usually better!) For code performance measurements, many R developers prefer the package microbenchmark.

Here are two versions of a function that sums the terms of the harmonic series up to the value given by the parameter “limit” The first function does this with a conventional for loop and the second accomplishes the same task by creating a vector with a number of terms given by “limit”.

MyLoop01 <- function(limit)

{

p = 0.0

for (n in limit:1)

{

p = p + 1/n

}

return(p)

}

MyVector01 <- function(limit)

{

n=seq(limit,1, by=-1)

p=sum(1/n)

return(p)

}

While it is not relevant to our present concern with performance, it is worth noting that when summing a series in which the values are decreasing, it is numerically more accurate to sum starting with the smallest values first.

Now we can compare the performance of the two functions using microbenchmark:

> microbenchmark(MyLoop01(5000),MyVector01(5000))

Unit: microseconds

expr                         min            lq               mean         median         uq              max      neval

MyLoop01(5000) 1928.236 2042.922 2162.0970 2069.4740 2166.087 2868.48 100

MyVector01(5000) 167.790     171.136    185.4027    176.9375     194.787   262.84 100

Note that the default for microbenchmark is to evaluate each expression 100 times.

Clearly, in this comparison the vectorized version is the hands-down winner.

Of course, it is easy to see how to apply vectorization to a simple summation operation. In more complex cases, however, it is often difficult to see the vector solution or even see if such a solution exists. We’ll illustrate this with a calculation of pi using the Gregory-Leibniz series. Mathematicians will be quick to point out that this is a poor way to calculate pi, since the series converges very slowly. But our goal is not calculating pi, our goal is examining the performance benefit that be be achieved using vectorization.

Here is a formula for the Gregory-Leibniz series:
Gregory

Here is the Gregory-Leibniz series in summation notation:

GregorySummation
The straightforward implementation using an R loop would look like this:

MyLoop02=function (limit)

{

p = 0

for (n in limit:0) {

p = (-1)^n/(2 * n + 1) + p

}

return(p)

}

We see that this formula is easily translated into an R function using a loop, but what about vectorization?

We immediately notice the terms involve 3, 5, 7 and so forth, so our vector cannot be simply the counting numbers. More problematic is the reversal of sign with each term. How can we manage that in a vector style? In the previous example, we saw that an R sequence can be defined using whatever increment we wish. Choosing an increment of two would provide values of 1,3,5, etc, but what we actually need is an increment of four. If the values in the R vector increment by four, then the formula must contain two terms, not one. One term can then be positive and the other negative.

MyVector02=function (limit)

{

n = seq(1, limit, by = 4)

p = sum(1/n – 1/(n + 2))

return(p)

}

 

When n is 1, the formula calculates 1-1/3. When n is 5, the formula calculates 1/5-1/7, giving us a sequence that sums to the Gregory-Leibniz series.

Once again, using microbenchmark, we can compare the performance of the classic loop style function with the vector form.

> microbenchmark(MyLoop02(5000),MyVector02(5000))

Unit: microseconds

expr                          min            lq            mean           median     uq                max       neval

MyLoop02(5000) 4996.636 5177.142 5457.8816 5346.047 5612.4575 7294.811   100

MyVector02(5000)    61.136    66.268     81.2932       73.631     91.7045   275.781   100

In this case, the vectorized version proves to be an even bigger winner when compared with the simple loop version.

Conclusion

In R programming, code vectorization can often provide an enormous performance benefit over the classic “for” loop. This benefit is sufficiently great that it is worth seeking vectorized solutions in cases where such a solution is not immediately obvious.

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.