I recently audited Learning Tree’s Hadoop Development course. That course is listed under the “Big Data” curriculum. It was a pretty good course. During the course, though, I got to thinking “What is ‘Big Data’ anyway?”
As far as I have been able to deduce, many things that come from Google have the prefix “Big” (e.g. BigTable). Since the original MapReduce came out of some work Google was doing internally back in 2004 we get the term “Big Data”. I guess maybe if MapReduce came out of Amazon we would now be talking about SimpleData or ElasticData instead – but I digress. Oftentimes these terms end up being hyped and confusing anyway. Anyone remember the state of Cloud Computing four or five years ago?
What is often offered as a definition, and I don’t necessarily disagree, is “data too large to fit into traditional storage”. That usually means too big for a relational database (RDMS). Sometimes, too, the nature of the data (i.e. structured, semi-structured or unstructured) comes into play. So what now? Enter NoSQL.
It seems to me that mostly what is meant by that is storing data using key/value pairs, although there are other alternatives as well. Key/value pairs are also often referred to as a dictionary, hash table, or associative array. It doesn’t matter what you call it, the idea is the same. Give me the key, I will return to you the value. The key or the value may be a simple or a complex data type. Often the exact physical details (i.e. indexing) of how this occurs are abstracted from the consumer. Also, some storage implementations seek to replicate the familiar SQL experience for users already familiar with the RDBS paradigm.
In any particular problem domain you should store your data in the manner that makes the most sense for your application. You should not always be constrained to think in terms of relational tables, file systems, or anything else. Ultimately you have the choice to store nothing more meaningful than blobs of data. Should you do that? Not necessarily and not always. There are a lot of good things about structured storage in general and relational databases in particular.
Probably the most popular framework for processing Big Data is Hadoop. Hadoop is an Apache project which, among other things, implements MapReduce. Analyzing massive amounts of data also requires heavy duty computing resources. For this reason Big Data and Cloud Computing often complement one another.
In the cloud you can very easily, quickly and inexpensively provision massive clusters of high powered servers to analyze vast amounts of data stored wherever and however is most appropriate. You have the choice of building your own machines from scratch or consuming one of the higher level services provided. Amazon’s Elastic MapReduce (EMR) service, for example, is a managed Hadoop cluster available as a service.
Still, there are many organizations who do build their own Hadoop clusters on-premises and will continue to do so. To do that there are a number of packaged distributions available (Cloudera, Hortonworks, EMC) or the download directly from Apache. So, whether you use the cloud or not it is pretty easy to get started with Hadoop.
To learn more about various technologies and techniques used to process and analyze Big Data Learning Tree currently offers four hands on courses:
All are available in person at a Learning Tree education center and remotely via AnyWare.