I, like many others, am constantly frustrated by the sensationalist use of terminology in data science. Twitter user @xaprb neatly summed up my own feelings when he tweeted
When you’re fundraising, it’s AI [Artificial Intelligence].
When you’re hiring, it’s ML [Machine Learning].
When you’re implementing, it’s linear regression.
When you’re debugging, it’s printf().
As someone with a background in operations research, I’m also stung by the following joke.
Q: What the difference between an operations researcher and a data scientist?
A: About $50k per annum.
Data science does, at times, feel like it’s putting shiny new labels on dusty old concepts. And, I think, this results in some confusion for those who are new to the field.
The following terms often appear to be synonymous.
Strangely enough, I rarely hear “statistics” or “operations research” mentioned in the same context. But that’s for another blog post…
French mathematician Henri Poincare said that mathematics was the art of giving the same name to different things. He may well have made the same observation of data science were he alive today.
Do labels matter? Andrew, Andy or Drew—it’s still just me.
In the case of data science, however, I think they do matter. The five areas mentioned above cover a huge amount of conceptual ground and organisations need to know what capabilities best serve their needs and how to recruit people with the appropriate skills.
If you recruit experts in logic-based AI and set them to work on your d3 dataviz projects the results are probably going to be suboptimal. Clarity is always welcome.
I also find that people new to the area—e.g., those in search of training—usually crave some understanding of the different areas/terms. It’s difficult to know what to learn if you can’t identify it. I’ve seen people who wanted to learn how to create charts struggling with support vector machines because they followed the hyped terms.
Let me be clear. I can’t offer a definitive definition of these terms. I can’t even ask you to agree with my definitions of them. I can only start shouting in an already crowded bazaar. However, it’s a conversation that keeps needing to be had. So, let’s begin.
Data analytics is the most common use of data in organisations. It’s widely and regularly used in day-to-day operations. It’s used to produce things like
They will often produce “one-off” charts for presentations or to inform particular business decisions.
Data analysis is generally about understanding what has happened, or is currently happening, in the organisation.
Data analysts should ideally be able to
Data science is somewhat of an umbrella term. However, if we attempt to distinguish it from the other terms above, it’s largely about making inferences from data. The data scientist is often attempting to create new knowledge from existing data—e.g., by producing predictions.
Data scientists look to uncover patterns in the data. This usually leads to more assumptions than are required in data analytics, and data scientists have to get used to most of their explorations ending up down blind alleys.
While data science projects do make use of well-structured relational data, they commonly involve the use of disparate, messy, unstructured data—such as customer feedback comments, or third-party datasets. Quality control becomes a big issue when working with such data.
Data scientists should ideally be able to
Big data is generally working with data that is too large to be processed using standard (e.g., workstation, single server) tools. The boundaries are constantly shifting. Apparently, it’s now possible to spin up a machine with 1TB of RAM in Azure—so workstations can handle fairly hefty loads.
Big data is partially an enabling technology for data analytics and data science. It provides the data that those areas require to sustain them. Big data platforms may be used to manage data that isn’t destined for more detailed analysis, such as logs stored for regulatory reasons.
Spark is generally the preference big data platform for data scientists. It has a powerful machine learning library (MLlib) that makes it easy to perform analyses on massive data sets.
Big data specialists should ideally be able to
Machine learning can probably be considered a subset of the tasks undertaken by a data scientist. However, as machine learning is a large and complex area, it is likely that a general data scientist won’t have a deep knowledge of machine learning techniques and tools.
Organisations that wish to make significant use of machine learning, and have relatively novel requirements, may turn to experts in a particular branch of machine learning (e.g., deep learning).
Designing and tuning machine learning systems to get the best from them can require significant specialist experience—an experience that a more general data scientist doesn’t have the time to gain and/or maintain. The tooling around many machine learning approaches (e.g., TensorFlow for deep learning) can take time to master.
Machine Learning is generally engaged to perform predictions based on complex data sets. It’s often the case that the problem domain is poorly understood and machine learning is deployed to try and uncover patterns that can be exploited to form predictions in previously uncharted areas/scenarios.
Machine learning specialists should ideally be able to
Artificial intelligence (AI) has been around since the 1950s. Enthusiasm for it has waxed and waned, but it’s currently experiencing a renaissance—largely through its contribution to machine learning.
Technologies pioneered by AI researchers that are now being used extensively by organisations include
Cheap computer processing and storage have transformed old AI techniques in practical technologies in recent years.
AI research provides the foundation for many of the capabilities discussed previously. It is mostly performed in universities or research institutes. Successful ideas are then picked up and transferred to operational use.
The skills required by an AI researcher are completely determined by the area of their research. By it’s very nature, research is highly specialised and attracts deep niche experts.
At the end of the day, it doesn’t matter where you draw the boundaries between these terms—or even what terms you use. But, it does matter that you have some terms, with clear, discrete definitions—and that you use them consistently in your organisation.
As in all human endeavours language matters. And, it’s especially important to attempt to be clear in areas where there is considerable pre-existing confusion.
Take a look at this Data Science Infographic to get more information and understanding on the difference and importance of the five topics discussed.