One of the questions I’m often asked is “How big does my data have to be before I need to start using big data tooling?”
There’s no right answer to this—but that’s largely because it’s not the right question.
Big Data is actually a pretty unhelpful term. It focuses attention exclusively on the volume of data your organization is dealing with. However, products like Spark and Hadoop are about much more than managing petabytes of data. And, the big data revolution is about much more than volume.
Research firm Gartner has defined big data as
high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.
This is known as the “three Vs” definition of big data
Volume, as previously mentioned, has tended to receive the focus when organizations start considering whether they need a big data capability—but it’s arguably the least important of the three Vs.
Granted, retaining all the data your organization generates can be useful. Diagnostic tasks, for example, become easier as you don’t have to try and guess what data is going to be important in solving some unspecified future problem—you just keep everything. But, having a lot of data doesn’t automatically deliver business insights. You need to do something with it.
Velocity is a challenge for traditional data management systems. As the Internet of Things grows, trying to capture streaming data from an array of sources can be like drinking from a fire-hose. Existing RDBMS can be dragged under by these demands. Big data streaming solutions, such as Flume, Spark Streaming and Storm, are designed to collect such data at scale.
Clickstreams and logs are other sources that can produce data at a phenomenal rate. Recording every user interaction on a high transaction site results in a deluge of data.
Variety is probably the most significant of the Vs. Hundreds of terabytes of financial data in a RDBMS is not big data—it’s large structured data. The promise of big data analysis is that we can draw on data that has been ignored by traditional data management and analysis tools. Log files, clickstreams, documents, e-mails, SMS messages, chats, tweets, images, videos, audio files, calenders, etc. contain a wealth of data that is important in understanding, and enhancing, your organization. Tragically, these sources have been largely ignored.
The focus on RDBMS has restricted the creativity of organizations in utilizing their resources to drive performance. We’ve largely ignored anything that can’t be stored effectively in relational tables. Big data changed all that. Anything that is stored electronically is now a resource for data-driven decision-making.
More “Vs” are being added to the big data definition as our understanding of the potential grows. Veracity is increasingly an important area of concern. As we expand the range of data that we can include in our analyses, there’s a concern that we’ll start to rely on data that is of questionable quality. This is one area in which we may have had to take a step back to move forward. RDBMS data sources, by virtue of the fact that decisions need to be made about what to store in that them and how, tend to have some curation of the data. Of course, there are still many examples of poor quality databases—and the fact that it is in a RDBMS can result in the faulty data acquiring a dangerous level of respectability.
One aspect of big data analysis that is usually overlooked is the processing side. “Big processing” can be required independently of big data. Computationally expensive algorithms can benefit from being parallelized across a compute cluster—even when the data volumes are modest. I’ve encountered this situation myself doing data envelopment analysis on Spark—hundreds of thousands of linear programs on mere gigabytes of data.
So, next time someone asks you about big data, make sure to explain to them that it’s not about how big it is.
Learning Tree has a number of courses for those interested in big data solutions.