Look at the working directory of the average data science project and you’ll see things like this:
cust-churn.csv cust-churn.R cust-churn-good.zip cust-churn-old.csv cust-churn-progess-meeting.R cust-churn-working.R cust-churn-working2.R cust-churn-20160217.R test.R test-bk.R
Every time a change needs to be made, files are copied to preserve the working code. As changes will often be made to multiple files, it’s common to zip the folder to preserve the “working set.” If something goes wrong you delete the working files and revert to the backup.
This is how software developers used to work many years ago–until they realized it’s (usually) a poor way to progress. Modern software development makes extensive use of version control.
A version control system is software that tracks and manages changes to files over their lifetime. Benefits of placing your projects under version control include the ability to:
The simple approach of copying files in a folder quickly gets out of control for anything beyond the simplest of projects. Unless you have very disciplined naming conventions it becomes impossible to know which files are obsolete, which files form a working set and which files constitute the deliverable results.
RStudio has built-in support for Git and SVN. In recent years, Git has become immensely popular, so that’s what we’ll use.
First of all we need to download and install Git. Like R, it’s available for all the major platforms.
Launch RStudio and create a new empty project. I created it in a folder called
version-control-demo. You’ll see the
.Rproj entry in your
File pane at the bottom-right. Click on this and select the
Git/SVN section. From the dropdown, select
Git. You’ll be asked if you want to initialize a new Git repository for the project. Select
Yes and restart RStudio when prompted.
I found I had to manually close and restart RStudio to get it to recognize that the project was under version control.
You should now see a
Git tab in the top-right pane. Select this tab and you’ll see the following files listed:
The question marks next to them mean they are not being managed by Git (yet).
Add a new R script. Let’s assume you are trying to predict whether customers will churn. In such an analysis you may wish to select a set of features to be used in the predictions, e.g.
features <- cust_data[, c(1, 3, 5)]
Save the script. I called mine
cust-churn.R. You’ll see it appear in the Git pane.
Let’s put our project files under version control. To add them we select the “Staged” checkboxes to the left of the file names. This changes the question marks to “A” (added) icons. This indicates our intention to commit the files to version control.
Before we can commit the files, Git needs to know who’s committing them. When we installed Git, we installed a command-line shell called “Git Bash”. Run this and change to your project directory, e.g.
Then enter the following commands (obviously replacing the personal details with your own)
git config user.email "email@example.com" git config user.name "Joe Bloggs"
Return to RStudio and press the
Commit button at the top of the
Git pane. When prompted for a commit message enter something like
Place project under version control
Commit button on the dialog to complete the operation. A message dialog will tell you that three changed files were committed. Close the dialog.
Our project is now under version control!
Let’s change the features we are using to predict churn. Let’s say that column 10 contains gender and we think that’s going to be important in our analysis. The updated line in our script would be
features <- cust_data[, c(1, 3, 5, 10)]
Save the file and you’ll see it appears in the
Git pane as Modified. Let’s commit our changes to Git.
Select the modified script’s
Staged checkbox and click the commit button at the top of the pane. You’ll be presented with a dialog showing the changes to be committed. One line in our script has been changed and the “before” and “after” lines are shown.
Add a commit message in the dialog.
Add gender as a predictor
Commit button to finalize the commit. Close the confirmation message. Close the “commit” dialog.
Let’s review the history of our project. Click the
History button at the top of the
Git pane and you’ll be presented with a list of your commits, from most to least recent.
Make another change to your script.
features <- cust_data[, c(1)]
Save the file.
Now, let’s say you are unhappy with that change and want to restore the previous features–but you’ve forgotten what they were! No problem. You have version control.
Display the project history and make sure the
Changes tab at the top-left of the
Review Changes dialog is selected. Now click the
Revert button at the top of that dialog. Confirm that you wish to perform this operation and—voilà! Your previous code will magically appear in your script editor.
Git is a very powerful system. The RStudio integration is limited to the basic features. If you wish to know more about Git and what it can do for you there’s a great free Kindle book titled “Ry’s Git Tutorial”.
Version control is essential for managing all but the simplest of data science projects. It’s not acceptable to be treating key business activities as if they were student projects.
If you are interested in using R for data science work you may wish to look at the following Learning Tree courses:
Git is covered in the following course: