Friday, April 18, 2014

HD Insight Client Lib Version

updated: 04.18.2014
Windows Azure HDInsight, 04/16/2014
Microsoft .NET API for Hadoop,, 04/16/2014
Microsoft .NET Map Reduce API for Hadoop, 0.12.5126.42915, 01/14/2014
Microsoft .NET API for Hadoop WebClient, 0.12.5126.42915, 01/14/2014
Microsoft Linq to Hive, 0.12.5126.42915, 01/14/2014
Microsoft .NET API Library for Avre,, 04/16/2014

Saturday, February 15, 2014

Azure Scheduler, simple but not work flow manager like oozie

Azure Scheduler was released about one month ago as public preview. I worked with Azure Scheduler team to implement the scheduler manager (work role) to manage our HD Insight jobs few weeks before its public preview release, end of 2013. The scheduler often apparently simple. but not really. many years ago. I was involved in big proposal for a local (Seattle) bi-tech company on scheduling after they just spent very handsome money on scheduling system designed for aviation industry like airline etc. 

Now, the Azure Scheduler has been running for us over 2 months, so far, it is pretty good on simple jobs. However, it is no way near Hadoop's oozie, but oozie is inside of Hadoop, make hard to manage. The scheduler is not work flower manager, will never be. we still develop our own work flow eng.

HDInsight 3.0 preview is NOT hadoop 2.2

Here is the list of Hadoop 2.2, HDI 3.0 preview and HDI 2.1 GA

More detail HDI 3.0 can be found

Sunday, December 8, 2013

Wednesday, November 20, 2013

“transient” cluster pattern ?

The issue came up if our data source on another Azure storage, but not attached to HDI Cluster ? as not in core-site.xml file. HDI Client lib does support to attach new storage once the Cluster provisioned.

Suggestion I received as TAP program member is that HDI cluster is in “transient” cluster pattern. My interpretation is prevision HDI on demand. Note HDI provision takes about 10 mins, here is one breakdown by my observation.

Accepted: 2 mins
Windows Azure VM Configuration: 4 mins
HDInsight Configuration: 3 mins
Operational: 1 mins
Running: ready. 

Good part is its default WASB can be reused, and no need moving data around. So if you need attach new storage, just delete existing one, attache storage during HDI prevision.  

I probably go for this approach, especially I was told my subscription cost last month was the highest in our team. why ? because I have many HDI cluster "running". I should delete them once I done..... that is "transient".

HDInsight 2.1 release

HDInsight Cluster has 2.1 a.k.a GA release, as well as its Client lib. You can get from NuGet in VS

I am having lots problem porting my HDI job submission code (1.5) to Azure worker role for scheduling, then discovered its latest release is 4.0 and 4.5 mixed.

Windows Azure HDInsight is 4.0, used for job submission. Hadoop WebClient is 4.5 to manage folder and file because it depends on WebAPI (5.x), which depends on 4.5. Hadoop will fail if output folder already exists, that is why I need Web Client to manage folder and file, using Azure HDInsight to submit job. interestingly, I can do both in 1.5 release, but not now (2.1). 

Windows Azure HDInsight is 4.0, your app can be 4.0 or 4.5, I confirmed.

one issue is missing: Microsoft.WindowsAzure.Management.Framework.Threading, solution: you just have to get it from previous release :)

2nd issue is "build for x64", any CPU will fail you.

With all trouble people reported, and I experienced. we all probably did not pay attention to what HDI team said in their roadmap

please note majority of their release is “experimental”.

HDInsight Map/Reduce Job Submission

To manage HDI job remotely, "Job Submission" is replacing WebHCat Client. however.....

main change is how to authenticate the request. Instead of Admin login, now requires subscription certs. however, it posts problem for WebClient+WorkerRole approach, I only can upload the cert to cloud service which hosts Worker Role. However, Worker Role can’t find the cert. I run my console app (exe) on Worker Role VM, same problem. Unless I install/import certs manually/directly to that VM, the console app works. The worker role still does not work even after I reboot this worker role VM.

Another issue I discovered was even HDI/Hadoop job fail due to existing output folder, I still get "completed" status of HDI M/R job I just submitted.

Info: HDI Cluster 2.1, HDI Client lib. 1.0