The HDInsight service on Azure has been in preview for some time. I have been anxious to start working with it as the idea of being able to leverage Hadoop using my favorite .NET programming language has a great appeal. Sadly I had never been able to successfully launch a cluster. Not, that is, until today. Perhaps I had not been patient enough in previous attempts, although on most tries I waited over an hour. Today, however, I was able to launch a cluster in the West US region that was up and running in about 15 minutes.
Once the cluster is running it can be managed through a web-based dashboard. It appears, however, that the dashboard will be eliminated in the future and that management will be done using PowerShell. I do hope that some kind of console interface remains but that may or may not be the case.
Figure 1. HDInsight Web-based dashboard
To make it easy to get started Microsoft provides some sample job flows. You can simply deploy any or all of these jobs to the provisioned cluster, execute the job and look at the output. All the necessary files to define the job flow and programming logic are supplied. These can also be downloaded and examined. I wanted to use a familiar language to write my mapper and reducer so I selected the C# sample. This is a simple word count job which is quite commonly used as an easily understood application of Map/Reduce. In this case the mapper and reducer are just simple C# console programs that read and write to stdin and stdout which are redirected to files or Azure Blob storage in the job flow.
Figure 2. Word count mapper and reducer C# code
One thing that is pretty cool about the Microsoft BI stack is that it is pretty straightforward to work with HDInsight output using the Microsoft BI Tools. For example the output from the job above can be consumed in Excel using the Power Query add-in.
Figure 3. Consuming HDInsight data in Excel using Power Query
That, however, is a discussion topic for another time!