Version control for data scientists using Git and RStudio

noun_48615_cc

Look at the working directory of the average data science project and you’ll see things like this:

cust-churn.csv
cust-churn.R
cust-churn-good.zip
cust-churn-old.csv
cust-churn-progess-meeting.R
cust-churn-working.R
cust-churn-working2.R
cust-churn-20160217.R
test.R
test-bk.R

Every time a change needs to be made, files are copied to preserve the working code. As changes will often be made to multiple files, it’s common to zip the folder to preserve the “working set.” If something goes wrong you delete the working files and revert to the backup.

This is how software developers used to work many years ago–until they realized it’s (usually) a poor way to progress. Modern software development makes extensive use of version control.

What is version control system?

A version control system is software that tracks and manages changes to files over their lifetime. Benefits of placing your projects under version control include the ability to:

  • revert back to previous versions of a file when you’ve broken something
  • restore accidentally deleted files
  • try out new ideas without interfering with the existing project
  • review why something “broke” (i.e. what changed)
  • coordinate work within teams so that analysts don’t trip over each other
  • maintain multiple versions of a project (e.g. for different departments)

The simple approach of copying files in a folder quickly gets out of control for anything beyond the simplest of projects. Unless you have very disciplined naming conventions it becomes impossible to know which files are obsolete, which files form a working set and which files constitute the deliverable results.

Version control using RStudio

There are a number of popular version control systems. Options include: Git, Mercurial, SVN and Team Foundation Server.

RStudio has built-in support for Git and SVN. In recent years, Git has become immensely popular, so that’s what we’ll use.

First of all we need to download and install Git. Like R, it’s available for all the major platforms.

Launch RStudio and create a new empty project. I created it in a folder called version-control-demo. You’ll see the .Rproj entry in your File pane at the bottom-right. Click on this and select the Git/SVN section. From the dropdown, select Git. You’ll be asked if you want to initialize a new Git repository for the project. Select Yes and restart RStudio when prompted.

I found I had to manually close and restart RStudio to get it to recognize that the project was under version control.

You should now see a Git tab in the top-right pane. Select this tab and you’ll see the following files listed:

  • .gitignore
  • version-control-demo.Rproj

The question marks next to them mean they are not being managed by Git (yet).

Add a new R script. Let’s assume you are trying to predict whether customers will churn. In such an analysis you may wish to select a set of features to be used in the predictions, e.g.

features <- cust_data[, c(1, 3, 5)]

Save the script. I called mine cust-churn.R. You’ll see it appear in the Git pane.

Let’s put our project files under version control. To add them we select the “Staged” checkboxes to the left of the file names. This changes the question marks to “A” (added) icons. This indicates our intention to commit the files to version control.

Before we can commit the files, Git needs to know who’s committing them. When we installed Git, we installed a command-line shell called “Git Bash”. Run this and change to your project directory, e.g.

cd /C/projects/version-control-demo/

Then enter the following commands (obviously replacing the personal details with your own)

git config user.email "you@company.com"
git config user.name "Joe Bloggs"

Return to RStudio and press the Commit button at the top of the Git pane. When prompted for a commit message enter something like

 Place project under version control

Press the Commit button on the dialog to complete the operation. A message dialog will tell you that three changed files were committed. Close the dialog.

Our project is now under version control!

Let’s change the features we are using to predict churn. Let’s say that column 10 contains gender and we think that’s going to be important in our analysis. The updated line in our script would be

features <- cust_data[, c(1, 3, 5, 10)]

Save the file and you’ll see it appears in the Git pane as Modified. Let’s commit our changes to Git.

Select the modified script’s Staged checkbox and click the commit button at the top of the pane. You’ll be presented with a dialog showing the changes to be committed. One line in our script has been changed and the “before” and “after” lines are shown.

Add a commit message in the dialog.

Add gender as a predictor

Click the Commit button to finalize the commit. Close the confirmation message. Close the “commit” dialog.

Let’s review the history of our project. Click the History button at the top of the Git pane and you’ll be presented with a list of your commits, from most to least recent.

Make another change to your script.

features <- cust_data[, c(1)]

Save the file.

Now, let’s say you are unhappy with that change and want to restore the previous features–but you’ve forgotten what they were! No problem. You have version control.

Display the project history and make sure the Changes tab at the top-left of the Review Changes dialog is selected. Now click the Revert button at the top of that dialog. Confirm that you wish to perform this operation and—voilà! Your previous code will magically appear in your script editor.

Git is a very powerful system. The RStudio integration is limited to the basic features. If you wish to know more about Git and what it can do for you there’s a great free Kindle book titled “Ry’s Git Tutorial”.

Version control is essential for managing all but the simplest of data science projects. It’s not acceptable to be treating key business activities as if they were student projects.

If you are interested in using R for data science work you may wish to look at the following Learning Tree courses:

Git is covered in the following course:

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.