Jul 19,
2016

One of the best things about R is that so many talented people have contributed a multitude of valuable packages. In fact there are so many, that for most tasks you don’t have to do any programming at all. However, there are always going to be problems that require custom treatment and that will demand some custom programming. R is rarely anyone’s first programming language, so many people bring to R experience they have gained in programming languages such as C# and Java. Unfortunately, some very fundamental technique that are fine for C# and Java can be inferior performers in the R environment. Most notable among these is the use of loops. In R, loops will, in general, perform more poorly than the same logic expressed using vectors.

To illustrate this, we will choose some simple problems from mathematics rather than statistics, so we can focus on the loop logic itself. We will start by doing a simple sum. We’ll use the harmonic series for our example, but the performance principles apply equally to any sum.

Of course, we will need to measure performance. We could do this ourselves using system.time, but our recurring theme is that someone else has already done it in R (and usually better!) For code performance measurements, many R developers prefer the package **microbenchmark**.

Here are two versions of a function that sums the terms of the harmonic series up to the value given by the parameter “limit” The first function does this with a conventional for loop and the second accomplishes the same task by creating a vector with a number of terms given by “limit”.

MyLoop01 <- function(limit)

{

p = 0.0

for (n in limit:1)

{

p = p + 1/n

}

return(p)

}

MyVector01 <- function(limit)

{

n=seq(limit,1, by=-1)

p=sum(1/n)

return(p)

}

While it is not relevant to our present concern with performance, it is worth noting that when summing a series in which the values are decreasing, it is numerically more accurate to sum starting with the smallest values first.

Now we can compare the performance of the two functions using microbenchmark:

> microbenchmark(MyLoop01(5000),MyVector01(5000))

Unit: microseconds

expr min lq mean median uq max neval

MyLoop01(5000) 1928.236 2042.922 2162.0970 2069.4740 2166.087 2868.48 100

MyVector01(5000) 167.790 171.136 185.4027 176.9375 194.787 262.84 100

Note that the default for microbenchmark is to evaluate each expression 100 times.

Clearly, in this comparison the vectorized version is the hands-down winner.

Of course, it is easy to see how to apply vectorization to a simple summation operation. In more complex cases, however, it is often difficult to see the vector solution or even see if such a solution exists. We’ll illustrate this with a calculation of pi using the Gregory-Leibniz series. Mathematicians will be quick to point out that this is a poor way to calculate pi, since the series converges very slowly. But our goal is not calculating pi, our goal is examining the performance benefit that be be achieved using vectorization.

Here is a formula for the Gregory-Leibniz series:

Here is the Gregory-Leibniz series in summation notation:

The straightforward implementation using an R loop would look like this:

MyLoop02=function (limit)

{

p = 0

for (n in limit:0) {

p = (-1)^n/(2 * n + 1) + p

}

return(p)

}

We see that this formula is easily translated into an R function using a loop, but what about vectorization?

We immediately notice the terms involve 3, 5, 7 and so forth, so our vector cannot be simply the counting numbers. More problematic is the reversal of sign with each term. How can we manage that in a vector style? In the previous example, we saw that an R sequence can be defined using whatever increment we wish. Choosing an increment of two would provide values of 1,3,5, etc, but what we actually need is an increment of four. If the values in the R vector increment by four, then the formula must contain two terms, not one. One term can then be positive and the other negative.

MyVector02=function (limit)

{

n = seq(1, limit, by = 4)

p = sum(1/n – 1/(n + 2))

return(p)

}

When n is 1, the formula calculates 1-1/3. When n is 5, the formula calculates 1/5-1/7, giving us a sequence that sums to the Gregory-Leibniz series.

Once again, using microbenchmark, we can compare the performance of the classic loop style function with the vector form.

> microbenchmark(MyLoop02(5000),MyVector02(5000))

Unit: microseconds

expr min lq mean median uq max neval

MyLoop02(5000) 4996.636 5177.142 5457.8816 5346.047 5612.4575 7294.811 100

MyVector02(5000) 61.136 66.268 81.2932 73.631 91.7045 275.781 100

In this case, the vectorized version proves to be an even bigger winner when compared with the simple loop version.

**Conclusion**

In R programming, code vectorization can often provide an enormous performance benefit over the classic “for” loop. This benefit is sufficiently great that it is worth seeking vectorized solutions in cases where such a solution is not immediately obvious.