Hadoop for Beginner: February 2015

Versioning of Hadoop

The main versions or branches of Hadoop are

1. Version 0.20.0–0.20.2: - The 0.20 branch of Hadoop is said to be the most stable version and is the most commonly used version in production. The first release was in April 2009. Cloudera CDH2 and CDH3 are both based off of this branch.

2. 0.20-append: - This version includes the support for file appends in HDFS which was needed for Apache HBase and was missing in version 0.20. This branch with the file append feature was called 0.20-append. No official release was ever made from the 0.20-append branch.

3. 0.20-security: - Yahoo, one of the major contributors to Apache Hadoop, invested in adding full Kerberos support to core Hadoop. It later contributed this work back to Hadoop in the form of the 0.20-security branch, a version of Hadoop 0.20 with Kerberos authentication support. This branch was later released as the 0.20.20X releases.

4. 0.20.203–0.20.205: - There was a strong desire within the community to produce an official release of Hadoop that included the 0.20-security work. The 0.20.20X releases contained not only security features from 0.20-security, but also bug fixes and improvements on the 0.20 line of development. Generally, it no longer makes sense to deploy these releases as they’re superseded by 1.0.0.

5. 0.21.0: - The 0.21 branch was cut from Hadoop trunk and released in August 2010. This was considered a developer preview or alpha quality release to highlight some of the features that were currently in development at the time. Despite the warning from the Hadoop developers, a small number of users deployed the 0.21 release anyway. This release does not include security, but does have append feature.

6. 0.22.0: - In December 2011, the Hadoop community released version 0.22, which was based on trunk, like 0.21 was. This release includes security, but only for HDFS. Also a bit strange, 0.22 was released after 0.23 with less functionality. This was due to when the 0.22 branch was cut from trunk.

7. 0.23.0: - In November 2011, version 0.23 of Hadoop was released. Also cut from trunk, 0.23 includes security, append, YARN, and HDFS federation. This release has been dubbed a developer preview or alpha-quality release. This line of development is superseded by 2.0.0.

8. 1.0.0: - Version 1.0.0 of Hadoop was released from the 0.20.205 line of development. This means that 1.0.0 does not contain all of the features and fixes found in the 0.21, 0.22, and 0.23 releases. It does include security feature.

9. 1.2.1: - The stable version of 1.2 line version was released on 1 Aug, 2013.

10. 2.0.0-alpha: - In May 2012, version 2.0.0 was released from the 0.23.0 branch and like 0.23.0, is considered alpha-quality and is the first version in the hadoop-2.x series. This includes YARN and removes the traditional MRv1 jobtracker and tasktracker daemons. While YARN is API compatible with MRv1, the underlying implementation is different. This includes

o YARN aka NextGen MapReduce

o HDFS Federation

o Performance

o Wire-compatibility for both HDFS and YARN/MapReduce.

11. 2.1.0-beta: - Hadoop 2.1.0-beta consists of the below significant improvements over the previous 1.X stable releases.

· HDFS Federation

· MapReduce NextGen aka YARN aka MRv2

· HDFS HA for NameNode (manual failover)

· HDFS Snapshots

· Support for running Hadoop on Microsoft Windows

· YARN API stabilization

· Binary Compatibility for MapReduce applications built on hadoop-1.x

· Substantial amount of integration testing with rest of projects in the ecosystem

What is Hadoop

Hadoop is a Java-based open source programming framework that supports the storage and processing of large data sets in a distributed computing environment. Hadoop is designed to run on a large number of commodity hardware machines that don’t share any memory or disks and can scale up or down without system interruption. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop consists of three main functions: storage, processing and resource management. Storage is accomplished with the Hadoop Distributed File System (HDFS). HDFS is a reliable distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers. Processing or Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated nodes or machines. YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models and was introduced in Hadoop 2.0.

Hadoop was created by Doug Cutting. The underlying technology was invented by Google so as to index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. Google’s innovations were incorporated into Nutch, an open source project and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

Although Hadoop is best known for MapReduce, HDFS and Yarn, the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing. Below are some of the ASF projects included in a typical Hadoop distribution.

MapReduce: - MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.

HDFS: - HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.

Apache Hadoop YARN: - YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.

Apache Tez: - Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex Directed Acyclic Graph (DAG) of tasks for near real-time big data processing.

Apache Pig: - A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.

Apache HCatalog: - A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.

Apache Hive: -Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.

Apache HBase: - A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.

Apache Mahout: - Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.

Apache Accumulo: - Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.

Apache Flume: - Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to to Hadoop.

Apache Sqoop: - Sqoop is a tool that speeds and eases movement of data in and out of Hadoop to and from RDBMS. It provides a reliable parallel load for various, popular enterprise data sources.

Apache ZooKeeper: - A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.

Apache Ambari: - An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.

Apache Oozie: - Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.

Apache Falcon: - Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.

Apache Knox: - The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.

Installing Hadoop in a pseudo distributed mode

Step 1: Run the following command to install hadoop from yum repository in a pseudo distributed mode

sudo yum install hadoop-‐0.20-‐conf-‐pseudo

Step 2: Verify if the packages are installed properly

rpm ‐ql hadoop‐0.20‐conf‐pseudo

Step 3: Format the namenode

sudo ‐u hdfs hdfs namenode ‐format

Step 4: Stop existing services (As Hadoop was already installed for you, there might be some services running)

$for service in /etc/init.d/hadoop*
> do
>sudo $service stop
>done

Step 5: Start HDFS

$ for service in /etc/init.d/hadoop‐hdfs‐*
>do
>sudo $service start
>done

Step 6: Verify if HDFS has started properly (In the browser)

http://localhost:50070

Step 7: Create the /tmp directory

$sudo -u hdfs hadoop fs ‐mkdir /tmp

$sudo ‐u hdfs hadoop fs ‐chmod ‐R 1777 /tmp

Step 8: Create mapreduce specific directories

sudo ‐u hdfs hadoop fs ‐mkdir /var

sudo ‐u hdfs hadoop fs ‐mkdir /var/lib

sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop‐hdfs

sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop-hdfs/cache

sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop‐hdfs/cache/mapred

sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred

sudo ‐u hdfs hadoop fs ‐mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

sudo ‐u hdfs hadoop fs ‐chmod 1777 /var/lib/hadoop-

hdfs/cache/mapred/mapred/staging

sudo ‐u hdfs hadoop fs ‐chown ‐R mapred /var/lib/hadoop-

hdfs/cache/mapred

Step 9: Verify the directory structure

$sudo ‐u hdfs hadoop fs ‐ls ‐R /

Output should be

drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoophdfs/

cache

drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoophdfs/

cache/mapred

drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoophdfs/

cache/mapred/mapred

drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoophdfs/

cache/mapred/mapred/staging

Step 10: Start MapReduce

$ for service in /etc/init.d/hadoop-0.20‐mapreduce‐*
>do
>sudo $service start
>done

Step 11: Verify if MapReduce has started properly (In Browser)

http://localhost:50030

Step 12: Verify if the installation went on well by running a program

Step 12.1: Create a home directory on HDFS for the user

sudo ‐u hdfs hadoop fs ‐mkdir /user/training

sudo ‐u hdfs hadoop fs ‐chown training /user/training

Step 12.2: Make a directory in HDFS called input and copy some XML

files into it by running the following commands

$hadoop fs ‐mkdir input

$hadoop fs ‐put /etc/hadoop/conf/*.xml input

$hadoop fs ‐ls input

Found 3 items:

‐rw-r-‐r-- 1 joe supergroup 1348 2012‐02‐13 12:21 input/core-site.xml

‐rw-r-‐r-- 1 joe supergroup 1348 2012‐02‐13 12:21 input/hdfs-site.xml

‐rw-r-‐r-- 1 joe supergroup 1348 2012‐02‐13 12:21 input/mapred-site.xml

Step 12.3: Run an example Hadoop job to grep with a regular expression

in your input data.

$/usr/bin/hadoop jar /usr/lib/hadoop‐0.20‐mapreduce/hadoop‐examples.jar grep input output 'dfs[a‐z.]+'

Step 12.4: After the job completes, you can find the output in the HDFS

directory named output because you specified that output directory to

Hadoop.

$hadoop fs ‐ls

Found 2 items:

drwxr-xr-x - joe supergroup 0 2009-08-18 18:36

/user/joe/input

drwxr-xr-x - joe supergroup 0 2009-08-18 18:38

/user/joe/output

Step 12.5: List the output files

$ hadoop fs -ls output

Found 2 items

drwxr-xr-x - joe supergroup 0 2009-02-25

10:33 /user/joe/output/_logs

-rw-r--r-- 1 joe supergroup 1068 2009-02-25

10:33 /user/joe/output/part-00000

-rw-r--r- 1 joe supergroup 0 2009-02-25

10:33 /user/joe/output/_SUCCESS

Step 12.6: Read the output

$ hadoop fs -cat output/part-00000 | head

1 dfs.datanode.data.dir

1 dfs.namenode.checkpoint.dir

1 dfs.namenode.name.dir

1 dfs.replication

1 dfs.safemode.extension

1 dfs.safemode.min.

How Map Reduce Work in Pentaho in Hadoop

Overview

Kettle transformations are used to manipulate data and function as the map, combine, and reduce phases of a MapReduce application. The Kettle engine is pushed down to each task node and is executed for each task. The implementation that supports the data type conversion from Hadoop data types to Kettle data types, the passing
of tuples between input/output formats to the Kettle engine, and all associated configuration for the MapReduce
job is collectively called Pentaho MapReduce.

Type Mapping

In order to pass data between Hadoop and Kettle we must convert between Hadoop IO data types. Here's the
type mapping for the built in Kettle types:

Kettle Type	Hadoop Type
ValueMetaInterface.TYPE_STRING	org.apache.hadoop.io.Text
ValueMetaInterface.TYPE_BIGNUMBER	org.apache.hadoop.io.Text
ValueMetaInterface.TYPE_DATE	org.apache.hadoop.io.Text
ValueMetaInterface.TYPE_INTEGER	org.apache.hadoop.io.LongWritable
ValueMetaInterface.TYPE_LONG	org.apache.hadoop.io.DoubleWritable
ValueMetaInterface.TYPE_BOOLEAN	org.apache.hadoop.io.BooleanWritable
ValueMetaInterface.TYPE_BINARY	org.apache.hadoop.io.BytesWritable

Defining your own Type Converter

The Type Converter system is pluggable to support additional data types as required by custom Input/Output formats. The Type Converter SPI is a simple interface to implement: org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter. We use the Service Locator pattern; specifically Java's ServiceLoader, to resolve available converters at runtime. Providing your own is as easy as implementing ITypeConverter and providing a META-INF/services/org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter file with your implementation listed, both packaged into a jar placed in the plugins/pentaho-big-data-plugin/lib directory. You can find the default implementations defined here.

Distributed Cache

Pentaho MapReduce relies on Hadoop's Distributed Cache to distribute the Kettle environment, configuration, and plugins across the cluster. By leveraging the Distributed Cache network traffic is reduced up for subsequent executions as the Kettle environment is automatically configured on each node. This also allows you to use multiple version of Kettle against a single cluster.

How it works

Hadoop's Distributed Cache is a mechanism to distribute files into the working directory of each map and reduce task. The origin of these files is HDFS. Pentaho MapReduce will automatically configure the job to use a Kettle environment from HDFS (configured via pmr.kettle.installation.id, see ConfigurationOptions). If the desired Kettle environment does not exist, Pentaho MapReduce will take care of "installing" it in HDFS before executing the job.

The default Kettle environment installation path within HDFS is pmr.kettle.dfs.install.dir/$id, where $id is a uniquely identifying string but can easily be a custom build that is tailored for a specific set of jobs.

The Kettle environment is staged to HDFS at pmr.kettle.dfs.install.dir/pmr.kettle.installation.id as follows:

1. The contents of plugins/pentaho-big-data-plugin/pentaho-mapreduce-libraries.zip are extracted into HDFS athdfs://{pmr.kettle.dfs.install.dir}/{pmr.kettle.installation.id}

2. The Big Data Plugin contents are copied into pmr.kettle.installation.id/plugins/

a. Only the active Hadoop configuration is copied, and specifically:

i. The active Hadoop configuration's client-only libraries are not copied (config/lib/client)

ii. The active Hadoop configuration's "pmr" specific libraries are copied into the mainhdfs://{pmr.kettle.dfs.install.dir}/{pmr.kettle.installation.id}/lib/ of the installation. This allows the Hadoop configuration to provide libraries that are accessible within an Input or Output format (or otherwise outside of the standard transformation execution environment. This is necessary for reading directly out of HBase using the HBase TableInputFormat for example).

Configuration options

Pentaho MapReduce can be configured through the plugin.properties found in the plugin's base directory, or overridden per Pentaho MapReduce job entry if they are defined in the User Defined properties tab.

The currently supported configuration properties are:

Property Name	Description
pmr.kettle.installation.id	Version of Kettle to use from the Kettle HDFS installation directory. If not set a unique id is generated from the version of Kettle, the Big Data Plugin version, and the Hadoop Configuration used to communicate with the cluster and submit the Pentaho MapReduce job.
pmr.kettle.dfs.install.dir	Installation path in HDFS for the Kettle environment used to execute a Pentaho MapReduce job. This can be a relative path, anchored to the user's home directory, or an absolute path if it starts with a /.
pmr.libraries.archive.file	Pentaho MapReduce Kettle environment runtime archive to be preloaded intokettle.hdfs.install.dir/pmr.kettle.installation.id
pmr.kettle.additional.plugins	Comma-separated list of additional plugins (by directory name) to be installed with the Kettle environment. e.g. "steps/DummyPlugin,my-custom-plugin"

Customizing the Kettle Environment used by Pentaho MapReduce

The installation environment used by Pentaho MapReduce will be installed to pmr.kettle.dfs.install.dir/pmr.kettle.installation.id when the Pentaho MapReduce job entry is executed. If the installation already exists no modifications will be made and the job will use the environment as is. That means any modifications after the initial run, or any custom pre-loading of a kettle environment, will be used as is by Pentaho MapReduce.

Customizing the libraries used in a fresh Kettle environment install into HDFS

The pmr.libraries.archive.file contents are copied into HDFS at pmr.kettle.dfs.install.dir/pmr.kettle.installation.id. To make changes for initial installations, you must edit the archive referenced by this properly.

1. Unzip pentaho-mapreduce-libraries.zip, it contains a single lib/ directory with the required Kettle dependencies

2. Copy additional libraries to the lib/ directory

3. Zip up the lib/ directory into pentaho-mapreduce-libraries-custom.zip so the archive contains the lib/ with all jars within it (you may create subdirectories within lib/. All jars found in lib/ and its subdirectories will be added to the classpath of the executing job.)

4. Update plugin.properties and update the following properties:

5. pmr.kettle.installation.id=custom

6. pmr.libraries.archive.file=pentaho-mapreduce-libraries-custom.zip

The next time you execute Pentaho MapReduce the custom Kettle environment will be copied into HDFS at pmr.kettle.dfs.install.dir/custom and used when executing the job. You can switch between Kettle environments by specifying the pmr.kettle.installation.id property as a User Defined property per Pentaho MapReduce job entry or globally in the plugin.properties file*.

*Note: Only if the installation referenced by pmr.kettle.installation.id does not exist will the archive file and additional plugins currently configured will be used to "install" it into HDFS.

Customizing an existing Kettle environment in HDFS

You can customize an existing Kettle environment install in HDFS by manually copying jars and plugins into HDFS. This can be done manually (hadoop fs -copyFromLocal <localsrc> ... <dst> or with the Hadoop Copy Files job entry.

See Appendix B for the supported directory structure in HDFS.

Adding JDBC drivers to the Kettle environment

JDBC drivers and their required dependencies must be placed in the installation directory's lib/ directory.

Upgrading from the Pentaho Hadoop Distribution (PHD)

The PHD is no longer required and can be safely removed. If you have modified your Pentaho Hadoop Distribution installation you may wish to preserve these files so that the new Distributed Cache mechanism can take advantage of them. To do so follow the instructions above: Customizing the Kettle Environment used by Pentaho MapReduce.

If you're using a version of the Pentaho Hadoop Distribution (PHD) that allows you to configure the installation directory via mapred-site.xml, perform the following on all TaskTracker nodes:

1. Remove the pentaho.* properties from your mapred-site.xml

2. Remove the directories those properties referenced

3. Restart the TaskTracker process

Appendix A: pentaho-mapreduce-libraries.zip structure

pentaho-mapreduce-libraries.zip/

`- lib/

+- kettle-core-{version}.jar

+- kettle-engine-{version}.jar

`- .. (all other required Kettle dependencies and optional jars)

Appendix B: Example Kettle environment installation directory structure within DFS

/opt/pentaho/mapreduce/

+- 4.3.0/

| +- lib/

| | +- kettle-core-{version}.jar

| | +- kettle-engine-{version}.jar

| | +- .. (Any files in the active Hadoop configuration's {{lib/pmr/}} directory)

| | `- .. (all other required Kettle dependencies and optional jars - including JDBC drivers)

| `- plugins/

| +- pentaho-big-data-plugin/

| | `- hadoop-configurations/

| | `- hadoop-20/ (the active Hadoop configuration used to communicate with the cluster)

| | +- lib/ (the {{lib/pmr/}} and {{lib/client/}} directories are omitted here)

| | `- .. (all other jars)

| `- .. (additional optional plugins)

`- custom/

+- lib/

| +- kettle-core-{version}.jar

| +- kettle-engine-{version}.jar

| +- my-custom-code.jar

| `- .. (all other required Kettle dependencies and optional jars - including JDBC drivers)

`- plugins/

+- pentaho-big-data-plugin/

| ..

`- my-custom-plugin/

NB. This documentation is maintained by the Pentaho community.

HTML/JavaScript

document.write(ssyby);

Versioning of Hadoop

9. 1.2.1: - The stable version of 1.2 line version was released on 1 Aug, 2013.

document.write(ssyby);

What is Hadoop

document.write(ssyby);

Installing Hadoop in a pseudo distributed mode

document.write(ssyby);

How Map Reduce Work in Pentaho in Hadoop