In this article, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.

#### Contents

1 Introduction
2 Step-by-Step: Build the Infrastructure

## Step-by-Step: Install a Hadoop Distribution

Now that we have set up the infrastructure for a Hadoop cluster in Azure, it is time to get our hands dirty with installing the actual Hadoop distribution.

## 1. Install Ambari Server

We start off with installing an Ambari Server that allows for a “graphical” way of installing and deploying Hadoop.

#### 1.1. Set Up Bits

Log onto your master node (in this case oldkHDPm) as root. This node will serve as the main Installation host. Download the Ambari repository. Since we use CentOS 6 as our platform, access the repository as follows:

wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo

Following, copy the files to your repos.d:

cp ambari.repo /etc/yum.repos.d

You can confirm that the repository is configured, by running yum repolist. You then obtain a list of repo id’s and repo names as marked in blue below. The command may vary depending on the platform (see here for more information).

Now, we can install the Ambari bits by running yum install ambari-server.

#### 1.2. Set Up Ambari

Now that the Ambari server is installed, let us set it up. Run
ambari-server setup

Here, we do not customise the user account for the ambari-server daemon since we have already changed the root password. Likewise, we accept the default settings. More Information can be found here in the Hortonworks documentation.

#### 1.3. Start Ambari

The Ambari server is set up and installed – ready to be started:

ambari-server start

To have a look at the Ambari server processes, type in:

ps –ef | grep ambari

In case, more than one process is running ambari, kill the other process as follows:

Now we are ready to install the Hadoop distribution, i.e. HDP 2.1, using Ambari. Likewise, we will go along the Hortonworks documentation (here).

http://{ambari-server}:8080

In this case: http://oldkHDPm.oldkHDP.oliviak.com:8080

Name your cluster (see Hortonworks documentation), e.g. oldkHDPcluster:

Select your desired stack (see Hortonworks docs). We choose the latest for the time being, i.e. HDP 2.1:

The next window specifies the install options. Before we go into it, we take a little de-tour, i.e. how to copy the SSH private key onto the DNS server.

#### Detour: How to Copy the SSH Private Key to the Local Machine

For that purpose, we install WinSCP that enables the secure file Transfer between a local and a remote computer. Once installed, log in using the credentials to the master node (i.e. oldkHDP.cloudapp.net, port 22):

Use the WinSCP client to download the private SSH key (i.e. id_rsa) of the master node into the DNS server:

Once downloaded into the “local” machine, i.e. our DNS server, we can browse for it in the “Install Options” window:

Additionally, type in all the target hosts of your Hadoop cluster. In this case, it includes the master node and the three worker nodes:

oldkHDPm.oldkHDP.oliviak.com
oldkHDPw[1-3].oldkHDP.oliviak.com

When registering and confirming, you will be prompted with another window containing the host name pattern expressions:

Success – the hosts are confirmed. Have a look at the Hortonworks documentation for more information.

You may or may not receive some warnings as shown in the yellowish area:

It turns out that the ntpd services are not running but are required to be. You could run the HostCleanup Python script on each host…

…or manually get the ntpd services to run, by running

chkconfig ntpd on

on each host, i.e. the master and all three worker nodes:

To check the status of the ntpd services, run

service ntpd status

Back in the browser on the DNS server, rerun the checks:

Next, you can choose the services you wish to install on your Hadoop Cluster (see HDP documentation).

Next, select the hosts on which certain master components should run (see HDP doc). In this case, I choose to assign the master components of the Hive Server and the Oozie Server to the master node.

With the Ambari wizard, slave components (i.e. DataNodes, NodeManagers and RegionServers) can be appropriately assigned to certain hosts in the next window (see HDP doc).

Now you can manage the configuration settings for the Hadoop components along the tabs:

For instance, under HDFS we change the directories from

to the following:

or the remaining tabs marked with warnings, credentials are required, such as Nagios,

...Hive...

...and Oozie:

Finally, before deploying the Hadoop cluster you obtain the usual summary of configuration settings:

It contains the following information:

Services

• HDFS
• NameNode: oldkHDPm.oldkHDP.oliviak.com
• SecondaryNameNode: oldkHDPw1.oldkHDP.oliviak.com
• DataNodes: 3 Hosts
• YARN + MapReduce2
• NodeManager: 3 hosts
• ResourceManager: oldkHDPw1.oldkHDP.oliviak.com
• History Server: oldkHDPw1.oldkHDP.oliviak.com
• App Timeline Server: oldkHDPw1.oldkHDP.oliviak.com
• Tez
• Clients: 1 host
• Nagios
• Server: oldkHDPm.oldkHDP.oliviak.com
• Ganglia
• Server: oldkHDPm.oldkHDP.oliviak.com
• Hive + HCatalog
• Hive Metastore: oldkHDPm.oldkHDP.oliviak.com
• Database: MySQL (New Database)
• HBase
• Master: oldkHDPm.oldkHDP.oliviak.com
• RegionServers: 3 hosts
• Pig
• Clients: 1 host
• Sqoop
• Clients: 1 host
• Oozie
• Server: oldkHDPm.oldkHDP.oliviak.com
• Database: Derby (New Derby Database)
• Zookeeper
• Servers: 3 hosts
• Falcon
• Server: oldkHDPw1.oldkHDP.oliviak.com
• Storm
• Nimbus: oldkHDPm.oldkHDP.oliviak.com
• Storm REST API Server: oldkHDPm.oldkHDP.oliviak.com
• Storm UI Server: oldkHDPm.oldkHDP.oliviak.com
• DRPC Server: oldkHDPm.oldkHDP.oliviak.com
• Supervisor: 3 Hosts

And away you deploy: