Five R Packages Data Scientists Should Know About
One of the strengths of R is its comprehensive ecosystem of packages. If you want to do something, chances are that someone’s been there first and written the package. Of course, with so many packages out there, quality varies considerably—and some packages are so specific that it’s difficult to imagine them having a wide audience. […]
How Big is Big Data?
One of the questions I’m often asked is “How big does my data have to be before I need to start using big data tooling?” There’s no right answer to this—but that’s largely because it’s not the right question. Big Data is actually a pretty unhelpful term. It focuses attention exclusively on the volume of […]
How to Predict Outcomes Using Random Forests and Spark
Random forests are an ensemble, or model of models, machine learning approach. The algorithm builds multiple decision trees, based on different subsets of the features in the data. Outcomes are then predicted by running observations through all the trees and averaging the individual predictions. Think wisdom of crowds. Spark’s machine learning library, MLlib, has support […]
Paying by Numbers—Should Data Scientists Receive Performance Based Compensation?
A recent article suggests that, in the “near future”, data analysts will be compensated based on performance. They will receive commission-based payments, rather like salesmen, rather that being paid purely for their time. This performance will presumably be determined by the impact that the data analyst has on the key goals of the organization, e.g. […]
How to Build a Predictive Model using R
As the availability of high quality data continues to grow, the most successful organizations will be those that can draw value from it. This requires powerful analysis tools that can transform data into useful results. One such tool is R—a popular open-source language and environment for statistical analysis. In a recent survey, data scientists identified […]