One of the strengths of R is its comprehensive ecosystem of packages. If you want to do something, chances are that someone’s been there first and written the package.
Of course, with so many packages out there, quality varies considerably—and some packages are so specific that it’s difficult to imagine them having a wide audience.
There are, however, packages that should be part of any data scientist’s arsenal. I’ll introduce five such packages in this article. I’ve chosen them as they are general purpose and should be applicable to a wide range of tasks.
dplyr is a slick package for manipulating data frames. Data frames are the main workhorse in R, so you’ll probably spend a lot of time manipulating them. With dplyr this becomes much easier.
The package aims to provide a function for each verb of data manipulation, specifically allowing you to
All of these things can be done using basic R, but dpylr makes it a lot easier. For example, to identify cars that have eight cylinders and more than 100 horsepower we use
filter(mtcars, cyl == 8, hp > 100)
Couldn’t be easier. Spend a little time becoming familiar with dplyr—it’s an investment that will keep paying off.
ggplot2 is an incredibly powerful package for creating charts and plots. Its design builds on Leland Wilkinson’s grammar of graphics, which results in a flexible foundation for producing quantitative graphics.
Hadley Wickham, the author of ggplot (and other packages mentioned in this article) has written an entire book on this package. If you regularly create graphics in R, you have to get up to speed with ggplot2.
When I’m pulling data from databases to be analyzed in R I occasionally need to convert between long and wide formats.
Long format data looks like
station type measurement
1 temperature 25.0
1 sunshine 10.0
1 rainfall 0.0
2 temperature 27.0
2 sunshine 10.5
2 rainfall 0.0
3 temperature 20.0
3 sunshine 8.5
3 rainfall 0.0
4 temperature 19.0
4 sunshine 3.0
4 rainfall 2.0
5 temperature 22.0
5 sunshine 7.0
5 rainfall 0.0
6 temperature 23.0
6 sunshine 5.0
6 rainfall 0.0
The same data set in wide format would be
station temperature sunshine rainfall
1 25 10.0 0
2 27 10.5 0
3 20 8.5 0
4 19 3.0 2
5 22 7.0 0
6 23 5.0 0
reshape2 is a package that makes conversion between long and wide formats, and vice versa, easy. To convert from long to wide format using reshape2 we use
weather_wide <- dcast(weather_long, station ~ type)
Manipulating data frames can be complicated for those new to R—especially those who use it infrequently.
sqldf allows you to query your R dataframes using SQL. So, if you already have strong SQL skills, you can exploit them while you get use to the native R syntax. Some things are just easier to express in SQL.
The package supports more than just simple select queries. Grouping, unions and joins are all supported.
For example, to get the average MPG broken down by numbers of cylinders using the mtcars dataset we can use
sqldf("SELECT cyl, AVG(mpg) AS average_mpg
FROM mtcars GROUP BY cyl")
String manipulation in R is pretty weak—it’s only recently that trimws()
was added to the base to support stripping of leading and trailing whitespace. Munging real-world data almost always requires some string manipulation, so having a powerful set of functions for doing this is essential.
stringr contains a whole bunch of handy features, including numerous pattern matching functions. For example, we can extract all the integers from a piece of text using
str_extract_all("2, 3 and 5 are the first 3 prime numbers", "\\d+")
We’ve only give a cursory introduction to each of the packages in this articles, so you are encouraged to follow the links supplied to learn more about each of the packages.
If you want to learn more about R, Learning Tree has a couple of courses that might be of interest