When teaching Learning Tree’s Cloud Computing course, a common question I am asked is ‘What is Hadoop ?’. There is a large and rapidly growing interest in Hadoop because many organisations have very large data sets that require analysing and this is where Hadoop can help. Hadoop is a scalable system for data storage and processing. In addition its architecture is fault tolerant. A key characteristic is that Hadoop scales economically to handle data-intensive applications making use of commodity hardware.
Example usage scenarios of Hadoop include risk analysis and market trends in large financial data sets, shopper recommendation engines for on-line retailers. Facebook uses Hadoop to analyse user behaviour and the effectiveness of its advetisements. To make all this work, Hadoop creates clusters of machines that can be scaled out and distributes work amongst them. Core to this is the Hadoop distributed file system which enables user data to be split across many machines in the cluster. To enable the data to be processed in parallel, Hadoop uses MapReduce. MapReduce maps the compute task across the cluster and then reduces all the results back into a coherent whole for the user.
Hadoop with MapReduce is an incredibly powerful combination and is available for instance on Amazon AWS as a Cloud Computing service. There are more apache projects built around Hadoop that add to its power including Hive a data warehousing facility that builds structure on the unstructured Hadoop data. The Hadoop database HBase provides real-time read/write and access to Hadoop data and Mahout is a machine learning library that can be used on Hadoop.
In summary, Hadoop is an incredibly powerful large scale data storage and processing facility that when combined with the supporting tools enables businesses to analyse their data in ways that previously required expensive specialist hardware and software. With companies such as Microsoft adopting Hadoop and a large ecosystem of support companies rapidly appearing Hadoop has a big role to play in the business intelligence of particularly medium and large enterprises.