Step 1 : Check ubuntu code name
$ cat /etc/lsb-release
Step 2 : Add repository to Ubuntu Trusty
$ sudo wget 'http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/cloudera.list' \
-O /etc/apt/sources.list.d/cloudera.list
Step 3 : Additional step for Trusty
This step ensures that you get the right ZooKeeper package for the current CDH release. You need to prioritize the Cloudera repository you have just added, such that you install the CDH version of ZooKeeper rather than the version that is bundled with Ubuntu Trusty.
To do this, create a file at /etc/apt/preferences.d/cloudera.pref with the following contents:
Package: *
Pin: release o=Cloudera, l=Cloudera
Pin-Priority: 501
Package: *
Pin: release n=raring
Pin-Priority: 100
Package: *
Pin: release n=trusty-cdh5
Pin-Priority: 600
Step 4 : Optionally Add a Repository Key[Ubuntu Trusty]
$ wget http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key -O archive.key
$ sudo apt-key add archive.key
$ sudo apt-get update
Step 5 : Install Hadoop in pseudo mode
$sudo apt-get install hadoop-0.20-conf-pseudo
$dpkg -L hadoop-0.20-conf-pseudo
$sudo -u hdfs hdfs namenode -format
$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done [if it not get started edit hadoo-env.sh file like below
sudo gedit
]
$ sudo -u hdfs hadoop fs -mkdir -p /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
$ sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
$ sudo -u hdfs hadoop fs -ls -R /
$ for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ;
done
$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER
$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
$ hadoop fs -ls
$ hadoop fs -ls output
$ hadoop fs -cat output/part-00000 | head
Configuraing Hadoop in CDH5
Step 1 : Configuring Network Names
Step 2 : Copy Hadoop Configuration
Step 3 : Configuring HDFS
3.1. core-site.xml[coonfiguration] (sudo gedit /etc/hadoop/conf.my_cluster/core-site.xml)
i. fs.defaultFS -> Specifies the NameNode and the default file system, in the form hdfs://<namenode host>:<namenode port>/. The default value is file///. The default file system is used to resolve relative paths; for example, if fs.default.name or fs.defaultFS is set to hdfs://mynamenode/, the relative URI /mydir/myfile resolves to hdfs://mynamenode/mydir/myfile. Note: for the cluster to function correctly, the <namenode> part of the string must be the hostname not the IP address.
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode-host.company.com:8020</value>
</property>
[considering host is localhost]
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
3.2. hdfs-site.xml[coonfiguration](sudo gedit /etc/hadoop/conf.my_cluster/hdfs-site.xml)
i. dfs.permissions.superusergroup -> Specifies the UNIX group containing users that will be treated as superusers by HDFS. You can stick with the value of 'hadoop' or pick your own group depending on the security policies at your site.
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
ii. dfs.name.dir or dfs.namenode.name.dir [on the NameNode]
This property specifies the URIs of the directories where the NameNode stores its metadata and edit logs. Cloudera recommends that you specify at least two directories. One of these should be located on an NFS mount point.
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
iii. dfs.data.dir or dfs.datanode.data.dir [on each DataNode]
This property specifies the URIs of the directories where the DataNode stores blocks. Cloudera recommends that you configure the disks on the DataNode in a JBOD configuration, mounted at /data/1/ through /data/N, and configure dfs.data.dir or dfs.datanode.data.dir to specify file:///data/1/dfs/dn through file:///data/N/dfs/dn/.
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
</property>
After specifying these directories as shown above, you must create the directories and assign the correct file permissions to them on each node in your cluster.
In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.
Local directories:
The dfs.name.dir or dfs.namenode.name.dir parameter is represented by the /data/1/dfs/nn and /nfsmount/dfs/nn path examples.
The dfs.data.dir or dfs.datanode.data.dir parameter is represented by the /data/1/dfs/dn, /data/2/dfs/dn, /data/3/dfs/dn, and /data/4/dfs/dn examples.
3.3. To configure local storage directories for use by HDFS:
3.3.1. On a NameNode host: create the dfs.name.dir or dfs.namenode.name.dir local directories:
$ sudo mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
3.3.2. On all DataNode hosts: create the dfs.data.dir or dfs.datanode.data.dir local directories:
$ sudo mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
3.3.3. Configure the owner of the dfs.name.dir or dfs.namenode.name.dir directory, and of the dfs.data.dir or dfs.datanode.data.dir directory, to be the hdfs user:
$ sudo chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
Here is a summary of the correct owner and permissions of the local directories:
dfs.name.dir or dfs.namenode.name.dir -> hdfs:hdfs -> drwx------
dfs.data.dir or dfs.datanode.data.dir -> hdfs:hdfs -> drwx------
[The Hadoop daemons automatically set the correct permissions for you on dfs.data.dir or dfs.datanode.data.dir. But in the case of dfs.name.dir or dfs.namenode.name.dir, permissions are currently incorrectly set to the file-system default, usually drwxr-xr-x (755). Use the chmod command to reset permissions for these dfs.name.dir or dfs.namenode.name.dir directories to drwx------ (700); for example:
$ sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
[sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn]
or
$ sudo chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn]
3.4. Formatting the NameNode
sudo -u hdfs hdfs namenode -format [Before starting the NameNode for the first time you need to format the file system. ]
3.5. Configuring the Secondary NameNode
Add the name of the machine that will run the Secondary NameNode to the masters file.
Add the following property to the hdfs-site.xml file:
<property>
<name>dfs.namenode.http-address</name>
<value><namenode.host.address>:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
[considering host is localhost]
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
[In most cases, you should set dfs.namenode.http-address to a routable IP address with port 50070. you may want to set dfs.namenode.http-address to 0.0.0.0:50070 on the NameNode machine only, and set it to a real, routable address on the Secondary NameNode machine.The different addresses are needed in this case because HDFS uses dfs.namenode.http-address for two different purposes: it defines both the address the NameNode binds to, and the address the Secondary NameNode connects to for checkpointing. Using 0.0.0.0 on the NameNode allows the NameNode to bind to all its local addresses, while using the externally-routable address on the the Secondary NameNode provides the Secondary NameNode with a real address to connect to.]
3.6. Enabling Trash
Trash is configured with the following properties in the core-site.xml file:
fs.trash.interval -> 60
fs.trash.checkpoint.interval -> 60
3.7. Configuring Storage-Balancing for the DataNodes[optional]
3.8. Enabling WebHDFS
Set the following property in hdfs-site.xml:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
Step 4 : Deploying MapReduce v1 (MRv1) on a Cluster [i.e. Configuring Jobtracker & Tasktracker]
4.1. mapred-site.xml (sudo gedit /etc/hadoop/conf.my_cluster/mapred-site.xml)
4.1.1 : Configuring Properties for MRv1 Clusters
mapred.job.tracker(on job tracker i.e. on namenode) [If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. If the value is set to local, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using local) this must be the hostname (for example mynamenode) not the IP address. ]
<property>
<name>mapred.job.tracker</name>
<value>jobtracker-host.company.com:8021</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
4.1.2: Configure Local Storage Directories for Use by MRv1 Daemons
mapred.local.dir (on each TaskTracker ) [This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local. ]
<property>
<name>mapred.local.dir</name>
<value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>
4.2. Create the mapred.local.dir local directories:
$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
4.3. Configure the owner of the mapred.local.dir directory to be the mapred user:[Permissions :- drwxr-xr-x ]
$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
4.4. Configure a Health Check Script for DataNode Processes
#!/bin/bash
if ! jps | grep -q DataNode ; then
echo ERROR: datanode not up
fi
4.5. Enabling JobTracker Recovery
By default JobTracker recovery is off, but you can enable it by setting the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.
4.6. If Necessary, Deploy your Custom Configuration to your Entire Cluster [already done in previous step]
To deploy your configuration to your entire cluster:
Push your custom directory (for example /etc/hadoop/conf.my_cluster) to each node in your cluster; for example:
$ scp -r /etc/hadoop/conf.my_cluster myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster
Manually set alternatives on each node to point to that directory, as follows.
To manually set the configuration on Ubuntu and SLES systems:
$ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
$ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
4.7. If Necessary, Start HDFS on Every Node in the Cluster
Start HDFS on each node in the cluster, as follows:
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
4.8. If Necessary, Create the HDFS /tmp Directory[already created in the previous steps]
Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
4.9. Create MapReduce /var directories [already created in the previous steps]
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
4.10. Verify the HDFS File Structure
$ sudo -u hdfs hadoop fs -ls -R /
You should see:
drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
4.11. Create and Configure the mapred.system.dir Directory in HDFS
After you start HDFS and create /tmp, but before you start the JobTracker, you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.
To create the directory in its default location:
$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
4.12. Start MapReduce
To start MapReduce, start the TaskTracker and JobTracker services
On each TaskTracker system:
$ sudo service hadoop-0.20-mapreduce-tasktracker start
On the JobTracker system:
$ sudo service hadoop-0.20-mapreduce-jobtracker start
4.13. Create a Home Directory for each MapReduce User[already created in the previous steps]
Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER
4.14. Set HADOOP_MAPRED_HOME
For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
sudo gedit .bashrc
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_HOME=/usr/lib/hadoop
5. Trouble Shooting
Though the above steps are enough to run hadoop successfully in your machine but still sometime you might run into some problem. Like I have faced one of JAVA_HOME not set or not found though I have set the $JAVA_HOME variable in .bashrc. So question is what is the issue ? Well the issue here is very simple as the runtime environment of hadoop could not be able to find the JAVA_HOME because it is not set in hadoop-env file. Below steps you need to do here to resolve this issue:-
sudo gedit
$ cat /etc/lsb-release
Step 2 : Add repository to Ubuntu Trusty
$ sudo wget 'http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/cloudera.list' \
-O /etc/apt/sources.list.d/cloudera.list
Step 3 : Additional step for Trusty
This step ensures that you get the right ZooKeeper package for the current CDH release. You need to prioritize the Cloudera repository you have just added, such that you install the CDH version of ZooKeeper rather than the version that is bundled with Ubuntu Trusty.
To do this, create a file at /etc/apt/preferences.d/cloudera.pref with the following contents:
Package: *
Pin: release o=Cloudera, l=Cloudera
Pin-Priority: 501
Package: *
Pin: release n=raring
Pin-Priority: 100
Package: *
Pin: release n=trusty-cdh5
Pin-Priority: 600
Step 4 : Optionally Add a Repository Key[Ubuntu Trusty]
$ wget http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key -O archive.key
$ sudo apt-key add archive.key
$ sudo apt-get update
Step 5 : Install Hadoop in pseudo mode
$sudo apt-get install hadoop-0.20-conf-pseudo
$dpkg -L hadoop-0.20-conf-pseudo
$sudo -u hdfs hdfs namenode -format
$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done [if it not get started edit hadoo-env.sh file like below
sudo gedit
/etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0
]
$ sudo -u hdfs hadoop fs -mkdir -p /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
$ sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
$ sudo -u hdfs hadoop fs -ls -R /
$ for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ;
done
$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER
$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
$ hadoop fs -ls
$ hadoop fs -ls output
$ hadoop fs -cat output/part-00000 | head
Configuraing Hadoop in CDH5
Step 1 : Configuring Network Names
- Run uname -a and check that the hostname matches the output of the hostname command.
- Make sure the /etc/hosts file on each system contains the IP addresses and fully-qualified domain names (FQDN) of all the members of the cluster.
- Make sure the /etc/sysconfig/network file on each system contains the hostname you have just set (or verified) for that system
- Run /sbin/ifconfig and note the value of inet addr in the eth0 entry.
- Run host -v -t A `hostname` and make sure that hostname matches the output of the hostname command, and has the same IP address as reported by ifconfig for eth0.
Step 2 : Copy Hadoop Configuration
- $ sudo cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster [Copy the default configuration to your custom directory]
- $ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
- $ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster [To manually set the configuration on Ubuntu and SLES systems]
Step 3 : Configuring HDFS
3.1. core-site.xml[coonfiguration] (sudo gedit /etc/hadoop/conf.my_cluster/core-site.xml)
i. fs.defaultFS -> Specifies the NameNode and the default file system, in the form hdfs://<namenode host>:<namenode port>/. The default value is file///. The default file system is used to resolve relative paths; for example, if fs.default.name or fs.defaultFS is set to hdfs://mynamenode/, the relative URI /mydir/myfile resolves to hdfs://mynamenode/mydir/myfile. Note: for the cluster to function correctly, the <namenode> part of the string must be the hostname not the IP address.
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode-host.company.com:8020</value>
</property>
[considering host is localhost]
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
3.2. hdfs-site.xml[coonfiguration](sudo gedit /etc/hadoop/conf.my_cluster/hdfs-site.xml)
i. dfs.permissions.superusergroup -> Specifies the UNIX group containing users that will be treated as superusers by HDFS. You can stick with the value of 'hadoop' or pick your own group depending on the security policies at your site.
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
ii. dfs.name.dir or dfs.namenode.name.dir [on the NameNode]
This property specifies the URIs of the directories where the NameNode stores its metadata and edit logs. Cloudera recommends that you specify at least two directories. One of these should be located on an NFS mount point.
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
iii. dfs.data.dir or dfs.datanode.data.dir [on each DataNode]
This property specifies the URIs of the directories where the DataNode stores blocks. Cloudera recommends that you configure the disks on the DataNode in a JBOD configuration, mounted at /data/1/ through /data/N, and configure dfs.data.dir or dfs.datanode.data.dir to specify file:///data/1/dfs/dn through file:///data/N/dfs/dn/.
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
</property>
After specifying these directories as shown above, you must create the directories and assign the correct file permissions to them on each node in your cluster.
In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.
Local directories:
The dfs.name.dir or dfs.namenode.name.dir parameter is represented by the /data/1/dfs/nn and /nfsmount/dfs/nn path examples.
The dfs.data.dir or dfs.datanode.data.dir parameter is represented by the /data/1/dfs/dn, /data/2/dfs/dn, /data/3/dfs/dn, and /data/4/dfs/dn examples.
3.3. To configure local storage directories for use by HDFS:
3.3.1. On a NameNode host: create the dfs.name.dir or dfs.namenode.name.dir local directories:
$ sudo mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
3.3.2. On all DataNode hosts: create the dfs.data.dir or dfs.datanode.data.dir local directories:
$ sudo mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
3.3.3. Configure the owner of the dfs.name.dir or dfs.namenode.name.dir directory, and of the dfs.data.dir or dfs.datanode.data.dir directory, to be the hdfs user:
$ sudo chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
Here is a summary of the correct owner and permissions of the local directories:
dfs.name.dir or dfs.namenode.name.dir -> hdfs:hdfs -> drwx------
dfs.data.dir or dfs.datanode.data.dir -> hdfs:hdfs -> drwx------
[The Hadoop daemons automatically set the correct permissions for you on dfs.data.dir or dfs.datanode.data.dir. But in the case of dfs.name.dir or dfs.namenode.name.dir, permissions are currently incorrectly set to the file-system default, usually drwxr-xr-x (755). Use the chmod command to reset permissions for these dfs.name.dir or dfs.namenode.name.dir directories to drwx------ (700); for example:
$ sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
[sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn]
or
$ sudo chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn]
3.4. Formatting the NameNode
sudo -u hdfs hdfs namenode -format [Before starting the NameNode for the first time you need to format the file system. ]
3.5. Configuring the Secondary NameNode
Add the name of the machine that will run the Secondary NameNode to the masters file.
Add the following property to the hdfs-site.xml file:
<property>
<name>dfs.namenode.http-address</name>
<value><namenode.host.address>:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
[considering host is localhost]
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
[In most cases, you should set dfs.namenode.http-address to a routable IP address with port 50070. you may want to set dfs.namenode.http-address to 0.0.0.0:50070 on the NameNode machine only, and set it to a real, routable address on the Secondary NameNode machine.The different addresses are needed in this case because HDFS uses dfs.namenode.http-address for two different purposes: it defines both the address the NameNode binds to, and the address the Secondary NameNode connects to for checkpointing. Using 0.0.0.0 on the NameNode allows the NameNode to bind to all its local addresses, while using the externally-routable address on the the Secondary NameNode provides the Secondary NameNode with a real address to connect to.]
3.6. Enabling Trash
Trash is configured with the following properties in the core-site.xml file:
fs.trash.interval -> 60
fs.trash.checkpoint.interval -> 60
3.7. Configuring Storage-Balancing for the DataNodes[optional]
3.8. Enabling WebHDFS
Set the following property in hdfs-site.xml:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
Step 4 : Deploying MapReduce v1 (MRv1) on a Cluster [i.e. Configuring Jobtracker & Tasktracker]
4.1. mapred-site.xml (sudo gedit /etc/hadoop/conf.my_cluster/mapred-site.xml)
4.1.1 : Configuring Properties for MRv1 Clusters
mapred.job.tracker(on job tracker i.e. on namenode) [If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. If the value is set to local, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using local) this must be the hostname (for example mynamenode) not the IP address. ]
<property>
<name>mapred.job.tracker</name>
<value>jobtracker-host.company.com:8021</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
4.1.2: Configure Local Storage Directories for Use by MRv1 Daemons
mapred.local.dir (on each TaskTracker ) [This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local. ]
<property>
<name>mapred.local.dir</name>
<value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>
4.2. Create the mapred.local.dir local directories:
$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
4.3. Configure the owner of the mapred.local.dir directory to be the mapred user:[Permissions :- drwxr-xr-x ]
$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
4.4. Configure a Health Check Script for DataNode Processes
#!/bin/bash
if ! jps | grep -q DataNode ; then
echo ERROR: datanode not up
fi
4.5. Enabling JobTracker Recovery
By default JobTracker recovery is off, but you can enable it by setting the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.
4.6. If Necessary, Deploy your Custom Configuration to your Entire Cluster [already done in previous step]
To deploy your configuration to your entire cluster:
Push your custom directory (for example /etc/hadoop/conf.my_cluster) to each node in your cluster; for example:
$ scp -r /etc/hadoop/conf.my_cluster myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster
Manually set alternatives on each node to point to that directory, as follows.
To manually set the configuration on Ubuntu and SLES systems:
$ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
$ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
4.7. If Necessary, Start HDFS on Every Node in the Cluster
Start HDFS on each node in the cluster, as follows:
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
4.8. If Necessary, Create the HDFS /tmp Directory[already created in the previous steps]
Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
4.9. Create MapReduce /var directories [already created in the previous steps]
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
4.10. Verify the HDFS File Structure
$ sudo -u hdfs hadoop fs -ls -R /
You should see:
drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
4.11. Create and Configure the mapred.system.dir Directory in HDFS
After you start HDFS and create /tmp, but before you start the JobTracker, you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.
To create the directory in its default location:
$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
4.12. Start MapReduce
To start MapReduce, start the TaskTracker and JobTracker services
On each TaskTracker system:
$ sudo service hadoop-0.20-mapreduce-tasktracker start
On the JobTracker system:
$ sudo service hadoop-0.20-mapreduce-jobtracker start
4.13. Create a Home Directory for each MapReduce User[already created in the previous steps]
Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:
$ sudo -u hdfs hadoop fs -mkdir /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>
where <user> is the Linux username of each user.
Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER
4.14. Set HADOOP_MAPRED_HOME
For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
sudo gedit .bashrc
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_HOME=/usr/lib/hadoop
5. Trouble Shooting
Though the above steps are enough to run hadoop successfully in your machine but still sometime you might run into some problem. Like I have faced one of JAVA_HOME not set or not found though I have set the $JAVA_HOME variable in .bashrc. So question is what is the issue ? Well the issue here is very simple as the runtime environment of hadoop could not be able to find the JAVA_HOME because it is not set in hadoop-env file. Below steps you need to do here to resolve this issue:-
sudo gedit
/etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0
Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (seeconf/hadoop-default.xml
) available at these
locations:- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon