Tackling a new dataset

New Dataset

With all the excitement (hype?) surrounding machine learning, there’s a tendency for decision makers (and many analysts) to believe that insights are just waiting to spring forth from the data when they are finally released by these magical algorithms.

When it’s time to sit down and actually work with a real dataset, disillusionment quickly sets in. Data science is hard. And most of it involves stumbling down blind alleys…in the dark…blindfolded.

I’m regularly asked two questions by those starting out in data science.

  1. Where do I get the data from?
  2. What should I do when I’m handed a new dataset?

The first question tends to reflect the problem stated earlier—i.e. that machine learning is the answer to all our problems. That journey will not end well.

The second question is more interesting. Ideally, data science projects should stem from a well-defined question that is important to the business. So, it should define the data that needs to be collected and, consequently, define the role of the data.

However, in practice, it’s not always possible to be completely goal directed. The challenges of collecting raw data (e.g. cost, time, legislation) mean that most data science projects have a “run what ya brung” flavor to them—we have to make the best of the data we have.

So, what do I do when I find myself confronted with a large dataset for the first time?

The first thing I do is that I try to get decent definitions of the variables in the new dataset. Humans are terrible are working with abstract numbers. Having some mental model to work with helps a lot. The best resource for making sense of the columns is usually talking to people. Documentation tends to be sparse, out of date and misleading.

Next I run some basic statistical summaries on the data—minimum, maximum, mean, median, quartiles, variance. This often results in me challenging the definitions of the columns I received earlier.

After this, I histogram the variables. I’m looking for patterns (e.g. skewing, bi-modality). If there is a time element to the data I’ll also do time series plots. Again, I’m looking for patterns.

Once I’ve looked at the variables independently, I’ll create a correlation matrix/plot to see if any of the variables are measuring the same thing…or opposite things. You often see the same basic concept represented multiple times in a dataset. Databases grow over time to include all sorts of things.

Sophisticated Approaches

At this point, I start applying more sophisticated approaches to examine the relationship between values in the dataset. I’ll be using principal component analysis or multidimensional scaling to get a sense of the underlying dimensionality of the data. Hundreds of variables in an informal, real-world dataset may boil down to less than a dozen principal components.

I’m now ready to start applying machine learning to see if I can uncover any insights. The basic idea here is to start simple and, if it works, stay simple.

If I’m conducting unsurprised learning, I’ll do a k-means analysis—hard to think of a more straightforward machine learning technique.

For supervised learning problems, I tend to start with logistic regression. However, if I know I’ll need to explain the model, I’ll favor a decision tree analysis.

Only when simpler machine learning techniques fail would I consider deep learning approaches. The addition complexity involve in applying them isn’t warranted is simpler techniques can solve the problem effectively. Only if I was faced with a complex image/audio/text processing problem would I consider starting with more advanced techniques.

If you or your organization are getting into data science you may be interested in the following Learning Tree courses:

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.