Reproducibility is a critical component of data science. In fact, it’s critical to all research—witness the reproducibility crisis currently engulfing academic research.
It equally important in software development. We all expect our applications to work the same way…time after time. Software that operates erratically is undesirable. The unyielding regularly with which computers perform tasks could be considered their defining characteristic.
Source code control systems, such as Git, help us track and manage different versions of our software as we fix bugs and add new features.
However, modern applications are collections of third-party libraries and components plumbed together. These components are as much a part of our applications as the code we commit to source control—but we don’t control these components.
Tools such as YARN have been developed in an attempt to manage this process. They formally snapshot all the package dependencies associated with an application, thus documenting the exact dependencies held by an application at the point of development and/or testing. This allows faithful replication of the application across systems and time.
Any R application (e.g. Shiny application) or analysis script will draw heavily on R’s package ecosystem. This rich ecosystem is what makes R so powerful. Base R is getting on in years, so it’s the packages that keep it fresh and useful.
However, traditionally there’s been no formal package management system in R. Applications have a list of packages they draw upon, but version information is largely delegated to documentation…if it exists at all. Given that reproducibility would seem to be of fundamental importance in a statistical programming environment, this is a significant problem.
Many times I have installed a Shiny application, using the latest version of the packages it uses, only to find that the application won’t run—or produces unexpected results. And, if the package versions haven’t been documented, it’s not easy to recreate the original environment.
Two R packages have been created in an attempt to solve the package dependency problem in R—
When using Packrat packages are installed local to your R project. This means that projects can use different versions of the same package.
This can be invaluable when running a Shiny server that may be hosting multiple independent applications. In a traditional installation updating a package will update that package for all hosted applications. Packrat allows you to isolate changes to a single application.
As Packrat snapshots the versions of packages used in your application, you can safely restore it to a new location. So, if you send an analysis project to colleagues they will be able to reproduce your environment exactly…without impacting their own scripts.
Checkpoint is a package from another major R player—Revolution Analytics (now part of Microsoft).
Using Checkpoint you can replicate the dependencies that would have been in place on a particular date. As the developers say, it’s ”…as if you had a CRAN time machine.”
This is a little less flexible that Packrat as it requires you to have all your packages up to date as of the timestamp. If one of the packages you use is older than the current version at the time you snapshot, restoring from the date will not recreate the same environment. However, in practice, you are usually going to have your packages up to date when you create the snapshot, so it’s not a major flaw.
If we are going to be maintaining applications or scripts across systems or time it’s essential that we can faithfully recreate them. The way to ensure this is to make use of source code control and dependency management tools such as Packrat or Checkpoint. As R’s role grows, developers need to start adopting tools and processes that are common in enterprise development shops.
If you are interested in R or data science, you may wish to consider the following Learning Tree courses