Five R Packages Data Scientists Should Know About

R Packages

One of the strengths of R is its comprehensive ecosystem of packages. If you want to do something, chances are that someone’s been there first and written the package.

Of course, with so many packages out there, quality varies considerably—and some packages are so specific that it’s difficult to imagine them having a wide audience.

There are, however, packages that should be part of any data scientist’s arsenal. I’ll introduce five such packages in this article. I’ve chosen them as they are general purpose and should be applicable to a wide range of tasks.

dplyr

dplyr is a slick package for manipulating data frames. Data frames are the main workhorse in R, so you’ll probably spend a lot of time manipulating them. With dplyr this becomes much easier.

The package aims to provide a function for each verb of data manipulation, specifically allowing you to

  • Filter rows
  • Arrange rows in a give order
  • Select columns
  • Extract distinct (unique) rows
  • Mutate data frames by adding new columns
  • Summarize data via aggregation
  • Sample a subset of rows

All of these things can be done using basic R, but dpylr makes it a lot easier. For example, to identify cars that have eight cylinders and more than 100 horsepower we use

filter(mtcars, cyl == 8, hp > 100)

Couldn’t be easier. Spend a little time becoming familiar with dplyr—it’s an investment that will keep paying off.

ggplot2

ggplot2 is an incredibly powerful package for creating charts and plots. Its design builds on Leland Wilkinson’s grammar of graphics, which results in a flexible foundation for producing quantitative graphics.

Hadley Wickham, the author of ggplot (and other packages mentioned in this article) has written an entire book on this package. If you regularly create graphics in R, you have to get up to speed with ggplot2.

reshape2

When I’m pulling data from databases to be analyzed in R I occasionally need to convert between long and wide formats.

Long format data looks like

station        type measurement
      1 temperature        25.0
      1    sunshine        10.0
      1    rainfall         0.0
      2 temperature        27.0
      2    sunshine        10.5
      2    rainfall         0.0
      3 temperature        20.0
      3    sunshine         8.5
      3    rainfall         0.0
      4 temperature        19.0
      4    sunshine         3.0
      4    rainfall         2.0
      5 temperature        22.0
      5    sunshine         7.0
      5    rainfall         0.0
      6 temperature        23.0
      6    sunshine         5.0
      6    rainfall         0.0

The same data set in wide format would be

station temperature sunshine rainfall
      1          25     10.0        0
      2          27     10.5        0
      3          20      8.5        0
      4          19      3.0        2
      5          22      7.0        0
      6          23      5.0        0

reshape2 is a package that makes conversion between long and wide formats, and vice versa, easy. To convert from long to wide format using reshape2 we use

weather_wide <- dcast(weather_long, station ~ type)

sqldf

Manipulating data frames can be complicated for those new to R—especially those who use it infrequently.

sqldf allows you to query your R dataframes using SQL. So, if you already have strong SQL skills, you can exploit them while you get use to the native R syntax. Some things are just easier to express in SQL.

The package supports more than just simple select queries. Grouping, unions and joins are all supported.

For example, to get the average MPG broken down by numbers of cylinders using the mtcars dataset we can use

sqldf("SELECT cyl, AVG(mpg) AS average_mpg 
       FROM mtcars GROUP BY cyl")

stringr

String manipulation in R is pretty weak—it’s only recently that trimws() was added to the base to support stripping of leading and trailing whitespace. Munging real-world data almost always requires some string manipulation, so having a powerful set of functions for doing this is essential.

stringr contains a whole bunch of handy features, including numerous pattern matching functions. For example, we can extract all the integers from a piece of text using

str_extract_all("2, 3 and 5 are the first 3 prime numbers", "\\d+")

We’ve only give a cursory introduction to each of the packages in this articles, so you are encouraged to follow the links supplied to learn more about each of the packages.

If you want to learn more about R, Learning Tree has a couple of courses that might be of interest

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.