Showing posts with label hadoop_basic. Show all posts
Showing posts with label hadoop_basic. Show all posts

Analytics Tutorial: Learn Linear Regression in R

The R-Factor

There is often a gap in what we are taught in college and the knowledge that we need to possess to be successful in our professional lives. This is exactly what happened to me when I joined a consultancy firm as a business analyst. At that time I was a fresher coming straight from the cool college atmosphere, newly exposed to the Corporate Heat.
One day my boss called me to his office and told me that one of their clients, a big insurance company, was facing significant losses on auto insurance. They had hired us to identify and quantify the factors responsible for it. My boss emailed me the data that the company had provided and asked me to do a multivariate linear regression analysis on it. My boss told me to use R and make a presentation of the summary.
Now as a statistics student I was quite aware of the principles of a multivariate linear regression, but I had never used R. For those of you who are not aware, R is a statistical programming language. It is a very powerful tool and widely used across the world in analyzing data. Of course, I did not know this at that time.
Anyways, it took me a lot of surfing on the internet and reading books to learn how to fit my model in R. and now I want to help you guys save that time!
R is an open source tool easily available on the internet. I'll assume you have it installed on your computer. Else you can easily download and install it from www.r-project.org/
I have already converted the raw data file from the client into a clean .csv (comma separated) file. click here to download the file.
I've saved this on the D drive of computer in a folder called Linear_Reg_Sample. You can save it anywhere, but remember to change the path wherever a file path is mentioned.
Open the R software that you've installed. It's time to get started!

Let's Start Regression in R

The first thing to do is obviously read all our data in R. This can be easily done using the command: >LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv")
Here we read all the data into an object LinRegData, using a function read.csv().
NOTE: If you observe closely, you'll see that we have used \\ instead of a \. This is because of the construct of the language. Whenever you enter a path, make sure to use \\
Let's see if our data has been read by R. Use the following command to get a summary of the data: >summary(LinRegData)
This will give output
Summary of the Data
Image 1: Summary of input data
In the output you can see the distribution of data. The min, max, median, mean are shown for all the variables.

Performing the Regression Analysis

Now that the data has been loaded, we need to fit a regression model over it.
We will use the following command in R to fit the model:  >FitLinReg <- lm(Capped_Losses ~ Number_Vehicles + Average_Age + Gender_Dummy + Married_Dummy + Avg_Veh_Age + Fuel_Type_Dummy, LinRegData)
In this command, we create an object FitLinReg and store the results of our regression model in it. The lm() function is used to fit the model. Inside the model, Capped_Losses is our dependent variable which we are trying to explain using the other variables that are separated by a + sign. The last parameter of the formula is the source of the data.
If no error is displayed, it means our regression is done and the results are stored in FitLinReg. We can see the results using two commands:
 
1. >FitLinReg
This gives the output:
 
image
 
 
2. >summary(FitLinReg)
This gives the output:
Regression in R

The summary command gives us the intercepts of each variable, its standard error, t value and significance.
The output also tells us what the significance level of each variable is. For e.g., a *** variable highly is significant, a ** variable is significant at the 99.9% level and a space next to the variable indicates that it is not significant.
We can easily see that the Number_Vehicles variable is not significant and does not affect the model. We can remove this variable from the model.
If you go through what we've done till now, you will realize that it took us just two commands to fit a multivariate model in R. See how simple life has become!!!

Happy Ending!

In this way I learnt how to fit a regression model using R. I made a summary of my findings and made a presentation to the clients.
Linear Regression in R 
My boss was rather happy with me and I received a hefty bonus that year.

Hadoop Administration Interview Questions and Answers


https://www.dezyre.com/hadoop-tutorial/hadoop-multinode-cluster-setup


It is essential to prepare yourself in order to pass an interview and land your dream job. Here’s the first step to achieving this. The following are some frequently asked Hadoop Administration interview questions and answers that might be useful.

Explain check pointing in Hadoop and why is it important?

Check pointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient Namenode recovery and restart and is an important indicator of overall cluster health.
Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is  to store the HDFS namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essential that this metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

What is default block size in HDFS and what are the benefits of having smaller block sizes?

Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file. Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of megabytes, or gigabytes each.

What are two main modules which help you interact with HDFS and what are they used for?

user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole.

How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?

Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that image and log could be restored from the remaining disks/volumes if one of the disks fails.

How do you read a file from HDFS?

The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The client’s validated request is checked against the owner and permissions of the file.
Step 6: If the file exists and the user has access to it then the NameNode responds with the first block id and provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
Step 7: The client now contacts the most appropriate datanode directly and reads the block data. This process repeats until all blocks in the file have been read or the client closes the file stream.
If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case the information returned by the NameNode about block locations are outdated by the time the client attempts to contact a datanode, a retry will occur if there are other replicas or the read will fail.

What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?

Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the jobtracker. The three types of schedulers are:
  • FIFO (First in First Out) Scheduler
  • Fair Scheduler
  • Capacity Scheduler

How do you decide which scheduler to use?

The CS scheduler can be used under the following situations:
  • When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
  • When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
  • When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
  • When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
  • When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
  • When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
  • When you require jobs within a pool to make equal progress rather than running in FIFO order.

Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?

DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.
If these paramters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be lost and is critical if Namenode is restarted, as formatting information will be lost.

What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?

FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files under the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing by a client. To list such files, run FSCK with -openforwrite option.
FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks, corrupt blocks and missing replicas.

What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?

The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
  • Hadoop-env.sh
  • Core-site.xml
  • Hdfs-site.xml
  • Mapred-site.xml
  • Masters
  • Slaves
These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.

Creating Your First Map Reduce Programme

Opening the New Java Project wizard

The New Java Project wizard can be used to create a new java project. There are many ways to open this wizard:
  • By clicking on the File menu and choosing New > Java Project
  • By right clicking anywhere in the Project Explorer and selecting New > Java Project
  • By clicking on the New button ( ) in the Tool bar and selecting Java Project

Using the New Java Project wizard

The New Java Project Wizard has two pages.
On the first page:
  • Enter the Project Name
  • Select the Java Runtime Environment (JRE) or leave it at the default
  • Select the Project Layout which determines whether there would be a separate folder for the sources code and class files. The recommended option is to create separate folders for sources and class files.

    You can click on the Finish button to create the project or click on the Next button to change the java build settings.
    On the second page you can change the Java Build Settings like setting the Project dependency (if there are multiple projects) and adding additional jar files to the build path.

     Writing the Mapper Class

    As we all start up with writing some basic code for map reduce hence we will write a Word Count program which will simply count the number of words in a file and give a out put.

    Now here in the mapper class we write WordCountMapper


    package com.hadoop.training;

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    import java.util.StringTokenizer;

    public class WordCountMapper  extends Mapper<LongWritable,Text,Text,IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map (LongWritable key,Text value, Context context) throws IOException,InterruptedException {

    StringTokenizer itr = new  StringTokenizer(value.toString());

    while (itr.hasMoreTokens()){
    word.set(itr.nextToken());
    context.write(word,one);

    }

    }

    }

    Writing the Reducer Class 

    Now here in the reducer class we write WordCountReducer
     
    package com.hadoop.training;

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;

    public class  WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

        private IntWritable result = new IntWritable();

    public void reduce(Text key,Iterable<IntWritable> value, Context context) throws IOException,InterruptedException {
     int sum = 0;
    for (IntWritable val : value) {
    sum +=val.get();

    }
    result.set(sum);
     context.write(key,result);
    }

    }


    Writing the MapReduce driver class


    Writing the MapReduce driver class as WordCount

    package com.hadoop.training;

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class WordCount {

    public static void main (String[] args) throws Exception {

     if (args.length != 2) {
    System.err.println("Usage: MaxTemperature <input path> <output path>");
    System.exit(-1);
    }

    @SuppressWarnings("deprecation")
    Job job = new Job();
    job.setJarByClass(WordCount.class);
    job.setJobName("Word Count");

     FileInputFormat.addInputPath(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

     System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    }

    Running The Map Reduce programme


    $ hadoop jar WC.jar com.hadoop.training.WordCount hdfs://localhost:8020/user/rajeev/input hdfs://localhost:8020/user/rajeev/output

Installing Hadoop CDH5 on Ubuntu 64bit Trusty

Step 1 : Check ubuntu code name
$ cat /etc/lsb-release

Step 2 : Add repository to Ubuntu Trusty

$ sudo wget 'http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/cloudera.list' \
    -O /etc/apt/sources.list.d/cloudera.list

Step 3 : Additional step for Trusty

This step ensures that you get the right ZooKeeper package for the current CDH release. You need to prioritize the Cloudera repository you have just added, such that you install the CDH version of ZooKeeper rather than the version that is bundled with Ubuntu Trusty.
To do this, create a file at /etc/apt/preferences.d/cloudera.pref with the following contents:

Package: *
Pin: release o=Cloudera, l=Cloudera
Pin-Priority: 501

Package: *
Pin: release n=raring
Pin-Priority: 100

Package: *
Pin: release n=trusty-cdh5
Pin-Priority: 600

Step 4 : Optionally Add a Repository Key[Ubuntu Trusty]

$ wget http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key -O archive.key
$ sudo apt-key add archive.key
$ sudo apt-get update

Step 5 : Install Hadoop in pseudo mode 

$sudo apt-get install hadoop-0.20-conf-pseudo
$dpkg -L hadoop-0.20-conf-pseudo
$sudo -u hdfs hdfs namenode -format
$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done [if it not get started edit hadoo-env.sh file like below
sudo gedit /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0 
]
$ sudo -u hdfs hadoop fs -mkdir -p /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
$ sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
$ sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
$ sudo -u hdfs hadoop fs -ls -R /
$ for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ;
done
$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
$ sudo -u hdfs hadoop fs -chown $USER /user/$USER
$ hadoop fs -mkdir input
$ hadoop fs -put /etc/hadoop/conf/*.xml input
$ hadoop fs -ls input
$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
$ hadoop fs -ls
$ hadoop fs -ls output
$ hadoop fs -cat output/part-00000 | head

Configuraing Hadoop in CDH5

Step 1 : Configuring Network Names
  •  Run uname -a and check that the hostname matches the output of the hostname command.
  •  Make sure the /etc/hosts file on each system contains the IP addresses and fully-qualified domain names (FQDN) of all the members of the cluster.
  •  Make sure the /etc/sysconfig/network file on each system contains the hostname you have just set (or verified) for that system
  •  Run /sbin/ifconfig and note the value of inet addr in the eth0 entry.
  •  Run host -v -t A `hostname` and make sure that hostname matches the output of the hostname command, and has the same IP address as reported by ifconfig for eth0.


Step 2 : Copy Hadoop Configuration

  • $ sudo cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster [Copy the default configuration to your custom directory]
  • $ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
  • $ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster [To manually set the configuration on Ubuntu and SLES systems]

Step 3 : Configuring HDFS



3.1. core-site.xml[coonfiguration] (sudo gedit /etc/hadoop/conf.my_cluster/core-site.xml)
i. fs.defaultFS ->  Specifies the NameNode and the default file system, in the form hdfs://<namenode host>:<namenode port>/. The default value is file///. The default file system is used to resolve relative paths; for example, if fs.default.name or fs.defaultFS is set to hdfs://mynamenode/, the relative URI /mydir/myfile resolves to hdfs://mynamenode/mydir/myfile. Note: for the cluster to function correctly, the <namenode> part of the string must be the hostname not the IP address.

<property>
 <name>fs.defaultFS</name>
 <value>hdfs://namenode-host.company.com:8020</value>
</property>
[considering host is localhost]
<property>
 <name>fs.defaultFS</name>
 <value>hdfs://localhost:8020</value>
</property>


3.2. hdfs-site.xml[coonfiguration](sudo gedit /etc/hadoop/conf.my_cluster/hdfs-site.xml)
i. dfs.permissions.superusergroup -> Specifies the UNIX group containing users that will be treated as superusers by HDFS. You can stick with the value of 'hadoop' or pick your own group depending on the security policies at your site.

<property>
 <name>dfs.permissions.superusergroup</name>
 <value>hadoop</value>
</property>
ii.  dfs.name.dir or dfs.namenode.name.dir [on the NameNode]

This property specifies the URIs of the directories where the NameNode stores its metadata and edit logs. Cloudera recommends that you specify at least two directories. One of these should be located on an NFS mount point.

<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>

iii.  dfs.data.dir or dfs.datanode.data.dir [on each DataNode]
This property specifies the URIs of the directories where the DataNode stores blocks. Cloudera recommends that you configure the disks on the DataNode in a JBOD configuration, mounted at /data/1/ through /data/N, and configure dfs.data.dir or dfs.datanode.data.dir to specify file:///data/1/dfs/dn through file:///data/N/dfs/dn/.

<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
</property>
 After specifying these directories as shown above, you must create the directories and assign the correct file permissions to them on each node in your cluster.

In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration.

Local directories:

    The dfs.name.dir or dfs.namenode.name.dir parameter is represented by the /data/1/dfs/nn and /nfsmount/dfs/nn path examples.
    The dfs.data.dir or dfs.datanode.data.dir parameter is represented by the /data/1/dfs/dn, /data/2/dfs/dn, /data/3/dfs/dn, and /data/4/dfs/dn examples.

3.3. To configure local storage directories for use by HDFS:

3.3.1. On a NameNode host: create the dfs.name.dir or dfs.namenode.name.dir local directories:

$ sudo mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn

3.3.2. On all DataNode hosts: create the dfs.data.dir or dfs.datanode.data.dir local directories:

$ sudo mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

3.3.3. Configure the owner of the dfs.name.dir or dfs.namenode.name.dir directory, and of the dfs.data.dir or dfs.datanode.data.dir directory, to be the hdfs user:

$ sudo chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

Here is a summary of the correct owner and permissions of the local directories:

 dfs.name.dir or dfs.namenode.name.dir -> hdfs:hdfs -> drwx------
 dfs.data.dir or dfs.datanode.data.dir -> hdfs:hdfs -> drwx------
[The Hadoop daemons automatically set the correct permissions for you on dfs.data.dir or dfs.datanode.data.dir. But in the case of dfs.name.dir or dfs.namenode.name.dir, permissions are currently incorrectly set to the file-system default, usually drwxr-xr-x (755). Use the chmod command to reset permissions for these dfs.name.dir or dfs.namenode.name.dir directories to drwx------ (700); for example:

$ sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
[sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn]
or

$ sudo chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn]

3.4. Formatting the NameNode

sudo -u hdfs hdfs namenode -format [Before starting the NameNode for the first time you need to format the file system. ]

3.5. Configuring the Secondary NameNode
Add the name of the machine that will run the Secondary NameNode to the masters file.
Add the following property to the hdfs-site.xml file:
<property>
  <name>dfs.namenode.http-address</name>
  <value><namenode.host.address>:50070</value>
  <description>
    The address and the base port on which the dfs NameNode Web UI will listen.
  </description>
</property>

[considering host is localhost]
<property>
  <name>dfs.namenode.http-address</name>
  <value>localhost:50070</value>
  <description>
    The address and the base port on which the dfs NameNode Web UI will listen.
  </description>
</property>
[In most cases, you should set dfs.namenode.http-address to a routable IP address with port 50070. you may want to set dfs.namenode.http-address to 0.0.0.0:50070 on the NameNode machine only, and set it to a real, routable address on the Secondary NameNode machine.The different addresses are needed in this case because HDFS uses dfs.namenode.http-address for two different purposes: it defines both the address the NameNode binds to, and the address the Secondary NameNode connects to for checkpointing. Using 0.0.0.0 on the NameNode allows the NameNode to bind to all its local addresses, while using the externally-routable address on the the Secondary NameNode provides the Secondary NameNode with a real address to connect to.]

3.6. Enabling Trash
Trash is configured with the following properties in the core-site.xml file:
fs.trash.interval -> 60
fs.trash.checkpoint.interval -> 60

3.7. Configuring Storage-Balancing for the DataNodes[optional]

3.8. Enabling WebHDFS

 Set the following property in hdfs-site.xml:

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
</property>

Step 4 : Deploying MapReduce v1 (MRv1) on a Cluster [i.e. Configuring Jobtracker & Tasktracker]


4.1. mapred-site.xml (sudo gedit /etc/hadoop/conf.my_cluster/mapred-site.xml)
4.1.1 : Configuring Properties for MRv1 Clusters

mapred.job.tracker(on job tracker i.e. on namenode) [If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. If the value is set to local, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using local) this must be the hostname (for example mynamenode) not the IP address. ]

<property>
 <name>mapred.job.tracker</name>
 <value>jobtracker-host.company.com:8021</value>
</property>

<property>
 <name>mapred.job.tracker</name>
 <value>localhost:8021</value>
</property>

4.1.2: Configure Local Storage Directories for Use by MRv1 Daemons
mapred.local.dir (on each TaskTracker ) [This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local. ]

<property>
 <name>mapred.local.dir</name>
 <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>

4.2. Create the mapred.local.dir local directories:

$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local

4.3. Configure the owner of the mapred.local.dir directory to be the mapred user:[Permissions :- drwxr-xr-x ]

$ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local

4.4. Configure a Health Check Script for DataNode Processes

#!/bin/bash
if ! jps | grep -q DataNode ; then
 echo ERROR: datanode not up
fi

4.5. Enabling JobTracker Recovery

By default JobTracker recovery is off, but you can enable it by setting the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.

4.6. If Necessary, Deploy your Custom Configuration to your Entire Cluster [already done in previous step]

 To deploy your configuration to your entire cluster:

    Push your custom directory (for example /etc/hadoop/conf.my_cluster) to each node in your cluster; for example:

    $ scp -r /etc/hadoop/conf.my_cluster myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster

Manually set alternatives on each node to point to that directory, as follows.

 To manually set the configuration on Ubuntu and SLES systems:

$ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
$ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

4.7. If Necessary, Start HDFS on Every Node in the Cluster

Start HDFS on each node in the cluster, as follows:

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

4.8. If Necessary, Create the HDFS /tmp Directory[already created in the previous steps]

 Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

4.9. Create MapReduce /var directories [already created in the previous steps]

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

4.10. Verify the HDFS File Structure

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt   - hdfs     supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

4.11. Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and create /tmp, but before you start the JobTracker, you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.

 To create the directory in its default location:

$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system

4.12. Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

4.13. Create a Home Directory for each MapReduce User[already created in the previous steps]

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER


4.14. Set HADOOP_MAPRED_HOME
For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:

sudo gedit .bashrc
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
export HADOOP_HOME=/usr/lib/hadoop

5. Trouble Shooting

Though the above steps are enough to run hadoop successfully in your machine but still sometime you might run into some problem. Like I have faced one of JAVA_HOME not set or not found though I have set the $JAVA_HOME variable in .bashrc. So question is what is the issue ? Well the issue here is very simple as the runtime environment of hadoop could not be able to find the JAVA_HOME because it is not set in hadoop-env file. Below steps you need to do here to resolve this issue:-


sudo gedit /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0 

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

Hadoop Interview Question

1) Explain how Hadoop is different from other parallel computing solutions.
2) What are the modes Hadoop can run in?
3) What is a NameNode and what is a DataNode?
4) What is Shuffling in MapReduce?
5) What is the functionality of Task Tracker and Job Tracker in Hadoop? How many instances of a Task Tracker and Job Tracker can be run on a single Hadoop Cluster?
6) How does NameNode tackle DataNode failures?
7) What is InputFormat in Hadoop?
8) What is the purpose of RecordReader in Hadoop?
9) What are the points to consider when moving from an Oracle database to Hadoop clusters? How would you decide the correct size and number of nodes in a Hadoop cluster?
10) If you want to analyze 100TB of data, what is the best architecture for that?
11) What is InputSplit in MapReduce?
12)I n Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to the reducer?
13) What is replication factor in Hadoop and what is default replication factor level Hadoop comes with?
14) What is SequenceFile in Hadoop and Explain its importance?
15) What is Speculative execution in Hadoop?
16) If you are the user of a MapReduce framework, then what are the configuration parameters you need to specify?
17) How do you benchmark your Hadoop Cluster with Hadoop tools?
18) Explain the difference between ORDER BY and SORT BY in Hive?
19) What is WebDAV in Hadoop?
20) How many Daemon processes run on a Hadoop System?
21) Hadoop attains parallelism by isolating the tasks across various nodes; it is possible for some of the slow nodes to rate-limit the rest of the program and slows down the program. What method Hadoop provides to combat this?
22) How are HDFS blocks replicated?
23) What will a Hadoop job do if developers try to run it with an output directory that is already present?
24) What happens if the number of reducers is 0?
25) What is meant by Map-side and Reduce-side join in Hadoop?
26) How can the NameNode be restarted?
27) When doing a join in Hadoop, you notice that one reducer is running for a very long time. How will address this problem in Pig?
28) How can you debug your Hadoop code?
29) What is distributed cache and what are its benefits?
30) Why would a Hadoop developer develop a Map Reduce by disabling the reduce step?
31) Explain the major difference between an HDFS block and an InputSplit.
32) Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In which kind of scenarios MR jobs will be more useful than PIG?
33) What is the need for having a password-less SSH in a distributed environment?
34) Give an example scenario on the usage of counters.
35) Does HDFS make block boundaries between records?
36) What is streaming access?
37) What do you mean by “Heartbeat” in HDFS?
38) If there are 10 HDFS blocks to be copied from one machine to another. However, the other machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down during the time of replication?
39) What is the significance of conf.setMapper class?
40) What are combiners and when are these used in a MapReduce job?
41) Which command is used to do a file system check in HDFS?
42) Explain about the different parameters of the mapper and reducer functions.
43) How can you set random number of mappers and reducers for a Hadoop job?
44) Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason? (Open Ended Question)
45) Explain about the functioning of Master Slave architecture in Hadoop?
46) What is fault tolerance in HDFS?
47) Give some examples of companies that are using Hadoop architecture extensively.
48) How does a DataNode know the location of the NameNode in Hadoop cluster?
49) How can you check whether the NameNode is working or not?
50) Explain about the different types of “writes” in HDFS.

Installing Oracle Java 1.8 on Ubuntu 12.4 64bit

Download Oracle Java 1.8 64 bit for unix from  here.

tar -xvf ~/Downloads/jdk-8u40-linux-x64.tar.gz
sudo mkdir -p /usr/lib/jvm/jdk1.8.0

sudo mv jdk1.8.0_40/* /usr/lib/jvm/jdk1.8.0/

sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.8.0/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.8.0/bin/javaws" 1

sudo chmod a+x /usr/bin/java
sudo chmod a+x /usr/bin/javac
sudo chmod a+x /usr/bin/javaws
sudo chown -R root:root /usr/lib/jvm/jdk1.8.0
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config javaws

mkdir ~/.mozilla/plugins/

ln -s /usr/lib/jvm/jdk1.8.0/jre/lib/amd64/libnpjp2.so ~/.mozilla/plugins/

Setting JAVA_HOME
================
sudo su
which java
gedit .bashrc
#add the following lines
JAVA_HOME=/usr/lib/jvm/jdk1.8.0
export JAVA_HOME
PATH=$PATH:$JAVA_HOME
export PATH=$PATH:/usr/lib/jvm/jdk1.8.0/bin
export PATH

echo $JAVA_HOME

java -version 

Additional step on systems on which sudo clears or restricts environment variables, you also need to add the
following line to the /etc/sudoers file: 

sudo gedit /etc/sudoers

Defaults env_keep+=JAVA_HOME

sudo env | grep JAVA_HOME

Versioning of Hadoop


The main versions or branches of Hadoop are
1.      Version 0.20.0–0.20.2: - The 0.20 branch of Hadoop is said to be the most stable version and is the most commonly used version in production. The first release was in April 2009. Cloudera CDH2 and CDH3 are both based off of this branch.

2.      0.20-append: - This version includes the support for file appends in HDFS which was needed for Apache HBase and was missing in version 0.20. This branch with the file append feature was called 0.20-append. No official release was ever made from the 0.20-append branch.

3.      0.20-security: - Yahoo, one of the major contributors to Apache Hadoop, invested in adding full Kerberos support to core Hadoop. It later contributed this work back to Hadoop in the form of the 0.20-security branch, a version of Hadoop 0.20 with Kerberos authentication support. This branch was later released as the 0.20.20X releases.

4.      0.20.203–0.20.205: - There was a strong desire within the community to produce an official release of Hadoop that included the 0.20-security work. The 0.20.20X releases contained not only security features from 0.20-security, but also bug fixes and improvements on the 0.20 line of development. Generally, it no longer makes sense to deploy these releases as they’re superseded by 1.0.0.


5.      0.21.0: - The 0.21 branch was cut from Hadoop trunk and released in August 2010. This was considered a developer preview or alpha quality release to highlight some of the features that were currently in development at the time. Despite the warning from the Hadoop developers, a small number of users deployed the 0.21 release anyway. This release does not include security, but does have append feature.

6.      0.22.0: - In December 2011, the Hadoop community released version 0.22, which was based on trunk, like 0.21 was. This release includes security, but only for HDFS. Also a bit strange, 0.22 was released after 0.23 with less functionality. This was due to when the 0.22 branch was cut from trunk.

7.      0.23.0: - In November 2011, version 0.23 of Hadoop was released. Also cut from trunk, 0.23 includes security, append, YARN, and HDFS federation. This release has been dubbed a developer preview or alpha-quality release. This line of development is superseded by 2.0.0.

8.      1.0.0: - Version 1.0.0 of Hadoop was released from the 0.20.205 line of development. This means that 1.0.0 does not contain all of the features and fixes found in the 0.21, 0.22, and 0.23 releases. It does include security feature.

9.      1.2.1: - The stable version of 1.2 line version was released on 1 Aug, 2013.

10.  2.0.0-alpha: - In May 2012, version 2.0.0 was released from the 0.23.0 branch and like 0.23.0, is considered alpha-quality and is the first version in the hadoop-2.x series. This includes YARN and removes the traditional MRv1 jobtracker and tasktracker daemons. While YARN is API compatible with MRv1, the underlying implementation is different. This includes
o       YARN aka NextGen MapReduce
o       HDFS Federation
o       Performance
o       Wire-compatibility for both HDFS and YARN/MapReduce.
11.  2.1.0-beta: - Hadoop 2.1.0-beta consists of the below significant improvements over the previous 1.X stable releases.
·         HDFS Federation
·         MapReduce NextGen aka YARN aka MRv2
·         HDFS HA for NameNode (manual failover)
·         HDFS Snapshots
·         Support for running Hadoop on Microsoft Windows
·         YARN API stabilization
·         Binary Compatibility for MapReduce applications built on hadoop-1.x
·         Substantial amount of integration testing with rest of projects in the ecosystem

What is Hadoop



Hadoop is a Java-based open source programming framework that supports the storage and processing of large data sets in a distributed computing environment.  Hadoop is designed to run on a large number of commodity hardware machines that don’t share any memory or disks and can scale up or down without system interruption.  It is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop consists of three main functions: storage, processing and resource management. Storage is accomplished with the Hadoop Distributed File System (HDFS). HDFS is a reliable distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers. Processing or Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated nodes or machines. YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models and was introduced in Hadoop 2.0. 



Hadoop was created by Doug Cutting. The underlying technology was invented by Google so as to index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. Google’s innovations were incorporated into Nutch, an open source project and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.
Although Hadoop is best known for MapReduce, HDFS and Yarn, the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing. Below are some of the ASF projects included in a typical Hadoop distribution.
MapReduce: - MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. 
HDFS: - HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Apache Hadoop YARN: - YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
Apache Tez: - Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex Directed Acyclic Graph (DAG) of tasks for near real-time big data processing.
Apache Pig: - A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
Apache HCatalog: - A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
Apache Hive: -Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
Apache HBase: - A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
Apache Mahout: - Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
Apache Accumulo: - Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
Apache Flume: - Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to to Hadoop.
Apache Sqoop: - Sqoop is a tool that speeds and eases movement of data in and out of Hadoop to and from RDBMS. It provides a reliable parallel load for various, popular enterprise data sources.
Apache ZooKeeper: - A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
Apache Ambari: - An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
Apache Oozie: - Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
Apache Falcon: - Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.
Apache Knox: - The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.

Installing Hadoop in a pseudo distributed mode

Step 1: Run the following command to install hadoop from yum repository in a pseudo distributed mode

sudo yum install hadoop-­‐0.20-­‐conf-­‐pseudo

Step 2: Verify if the packages are installed properly

rpm ‐ql hadoop‐0.20­‐conf‐pseudo

Step 3: Format the namenode

sudo ‐u hdfs hdfs namenode ‐format

Step 4: Stop existing services (As Hadoop was already installed for you, there might be some services running)

$for service in /etc/init.d/hadoop*
> do
>sudo $service stop
>done

Step 5: Start HDFS

$ for service in /etc/init.d/hadoop‐hdfs­‐*
>do
>sudo $service start
>done

Step 6: Verify if HDFS has started properly (In the browser)

http://localhost:50070

Step 7: Create the /tmp directory
$sudo -u hdfs hadoop fs ‐mkdir /tmp
$sudo ‐u hdfs hadoop fs ‐chmod ‐R 1777 /tmp
Step 8: Create mapreduce specific directories
sudo ‐u hdfs hadoop fs ‐mkdir /var
sudo ‐u hdfs hadoop fs ‐mkdir /var/lib
sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop‐hdfs
sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop-hdfs/cache
sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop‐hdfs/cache/mapred
sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred
sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo ‐u hdfs hadoop fs ‐chmod 1777 /var/lib/hadoop-
hdfs/cache/mapred/mapred/staging
sudo ‐u hdfs hadoop fs ‐chown ‐R mapred /var/lib/hadoop-
hdfs/cache/mapred
Step 9: Verify the directory structure
$sudo ‐u hdfs hadoop fs ‐ls ‐R /
Output should be
drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoophdfs/
cache
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoophdfs/
cache/mapred
drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoophdfs/
cache/mapred/mapred
drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoophdfs/
cache/mapred/mapred/staging
Step 10: Start MapReduce
$ for service in /etc/init.d/hadoop-0.20‐mapreduce­‐*
>do
>sudo $service start
>done
Step 11: Verify if MapReduce has started properly (In Browser)
http://localhost:50030
Step 12: Verify if the installation went on well by running a program
Step 12.1: Create a home directory on HDFS for the user
sudo ‐u hdfs hadoop fs ‐mkdir /user/training
sudo ‐u hdfs hadoop fs ‐chown training /user/training
Step 12.2: Make a directory in HDFS called input and copy some XML
files into it by running the following commands
$hadoop fs ‐mkdir input
$hadoop fs ‐put /etc/hadoop/conf/*.xml input
$hadoop fs ‐ls input
Found 3 items:
rw-r-‐r-- 1 joe supergroup 1348 2012‐02‐13 12:21 input/core-site.xml
rw-r-‐r-- 1 joe supergroup 1348 2012‐02‐13 12:21 input/hdfs-site.xml
rw-r-‐r-- 1 joe supergroup 1348 2012‐02‐13 12:21 input/mapred-site.xml
Step 12.3: Run an example Hadoop job to grep with a regular expression
in your input data.
$/usr/bin/hadoop jar /usr/lib/hadoop‐0.20‐mapreduce/hadoop‐examples.jar grep input output 'dfs[a‐z.]+'
Step 12.4: After the job completes, you can find the output in the HDFS
directory named output because you specified that output directory to
Hadoop.
$hadoop fs ‐ls
Found 2 items:
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36
/user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38
/user/joe/output
Step 12.5: List the output files
$ hadoop fs -ls output
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-02-25
10:33 /user/joe/output/_logs
-rw-r--r-- 1 joe supergroup 1068 2009-02-25
10:33 /user/joe/output/part-00000
-rw-r--r- 1 joe supergroup 0 2009-02-25
10:33 /user/joe/output/_SUCCESS
Step 12.6: Read the output
$ hadoop fs -cat output/part-00000 | head
1 dfs.datanode.data.dir
1 dfs.namenode.checkpoint.dir
1 dfs.namenode.name.dir
1 dfs.replication
1 dfs.safemode.extension
1 dfs.safemode.min.

Reducer Interface



Reducer interface reduces a set of intermediate values which share a key to a smaller set of values. Once the Map tasks have finished, data is then transferred across the network to the Reducers. Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers.

All values associated with a particular intermediate key are guaranteed to go to the same Reducer. There may be a single Reducer, or multiple Reducers corresponding to a job. Each key is assigned to a partition using a component called the partitioner. The default partitioner implementation is a hash partitioner that takes a hash of the key, modulo the number of configured reducers in the job, to get a partition number. The Hash implementation ensures the hash of the key is always the same on all machines and the same key records are guaranteed to be placed in the same partition. The intermediate data isn’t physically partitioned, only logically so.

Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.
Reducer has 3 primary phases:

1.      Shuffle

The process of moving map outputs to the reducers using HTTP across the network is known as shuffling.

2.      Sort

The framework merge sorts Reducer inputs by keys so as to merge the same key outputs from different Mappers.
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

3.      Reduce

For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key as well as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer also receives as parameters OutputCollector and Reporter objects which are used in the same manner as in the map() method.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The Reducer outputs zero or more final key/value pairs which are written to HDFS. In practice, the Reducer usually emits a single key/value pair for each input key.