Creating Your First Map Reduce Programme

Opening the New Java Project wizard

The New Java Project wizard can be used to create a new java project. There are many ways to open this wizard:
  • By clicking on the File menu and choosing New > Java Project
  • By right clicking anywhere in the Project Explorer and selecting New > Java Project
  • By clicking on the New button ( ) in the Tool bar and selecting Java Project

Using the New Java Project wizard

The New Java Project Wizard has two pages.
On the first page:
  • Enter the Project Name
  • Select the Java Runtime Environment (JRE) or leave it at the default
  • Select the Project Layout which determines whether there would be a separate folder for the sources code and class files. The recommended option is to create separate folders for sources and class files.

    You can click on the Finish button to create the project or click on the Next button to change the java build settings.
    On the second page you can change the Java Build Settings like setting the Project dependency (if there are multiple projects) and adding additional jar files to the build path.

     Writing the Mapper Class

    As we all start up with writing some basic code for map reduce hence we will write a Word Count program which will simply count the number of words in a file and give a out put.

    Now here in the mapper class we write WordCountMapper


    package com.hadoop.training;

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    import java.util.StringTokenizer;

    public class WordCountMapper  extends Mapper<LongWritable,Text,Text,IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map (LongWritable key,Text value, Context context) throws IOException,InterruptedException {

    StringTokenizer itr = new  StringTokenizer(value.toString());

    while (itr.hasMoreTokens()){
    word.set(itr.nextToken());
    context.write(word,one);

    }

    }

    }

    Writing the Reducer Class 

    Now here in the reducer class we write WordCountReducer
     
    package com.hadoop.training;

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;

    public class  WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

        private IntWritable result = new IntWritable();

    public void reduce(Text key,Iterable<IntWritable> value, Context context) throws IOException,InterruptedException {
     int sum = 0;
    for (IntWritable val : value) {
    sum +=val.get();

    }
    result.set(sum);
     context.write(key,result);
    }

    }


    Writing the MapReduce driver class


    Writing the MapReduce driver class as WordCount

    package com.hadoop.training;

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class WordCount {

    public static void main (String[] args) throws Exception {

     if (args.length != 2) {
    System.err.println("Usage: MaxTemperature <input path> <output path>");
    System.exit(-1);
    }

    @SuppressWarnings("deprecation")
    Job job = new Job();
    job.setJarByClass(WordCount.class);
    job.setJobName("Word Count");

     FileInputFormat.addInputPath(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

     System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    }

    Running The Map Reduce programme


    $ hadoop jar WC.jar com.hadoop.training.WordCount hdfs://localhost:8020/user/rajeev/input hdfs://localhost:8020/user/rajeev/output

Eclipse Installation in Ubuntu

  1. Open a terminal (Ctrl-Alt-T) and switch it to root permissions by entering:
    $ sudo su
  2. Make sure Eclipse Indigo is NOT installed in your Ubuntu. You may need to remove both "eclipse" and "eclipse-platform" packages to get rid of it. If it still gets into way when trying to install Luna using this easy way, you may need to look at the "hard way" below.
    # apt-get remove eclipse eclipse-platform
  3. Install a Java 1.7 JDK:
    # apt-get install openjdk-7-jdk
  4. Install Maven:
    apt-get install maven
  5. Get rid of the root access as you won't need it anymore:
    # exit
  6. Download Eclipse. The "for Java EE Developers", "for Java Developers" and "for RCP and RAP Developers" versions all seem to work. Currently the file which was tested to work is (note that it is for 64 bit Ubuntu version) available at this page
  7. Extract the Eclipse installation tarball into your home directory:
    $ cd
    $ tar -xzvf <path/to/your-tar-file>
  8. Increase the memory for the Eclipse installation by modifying the ~/eclipse/eclipse.ini file.
    • Change the -Xmx setting (line 20) to be AT least 1G, recommended 2GB. (i.e. -Xmx2048m).
    • Change the -XX:MaxPermSize (line 18) to at most 512m. If you have the -Xmx setting set to 1G, then I suggest to use a lower value, for example 300m.
  9. Run the Eclipse:
    $ ~/eclipse/eclipse
  10. If everything seems to work, then configure it to have an icon in Desktop
          paste below command into the terminal and hit enter.
     gksudo gedit /usr/share/applications/eclipse.desktop
 
         Above command will create and open the launcher file for eclipse with gedit text editor.
        Paste below content into the opened file and save it.
       [Desktop Entry]
       Name=Eclipse 4
       Type=Application
       Exec=/home/rajeev/eclipse/eclipse
       Terminal=false
       Icon=/home/rajeev/eclipse/icon.xpm
      Comment=Integrated Development Environment
      NoDisplay=false
      Categories=Development;IDE;
      Name[en]=Eclipse

 

Splunk Installation in Ubuntu

sudo dpkg -i Downloads/splunk-6.2.3-264376-linux-2.6-amd64.deb
sudo /opt/splunk/bin/splunk start
http://localhost:8000

Splunk Impala
Splunk Hadoop Connect
[more info]

Installing R in Ubuntu Trusty

Step 1 :- Add the latest trusty link from cran to apt. [click here for reference]
sudo gedit /etc/apt/sources.list

deb http://cran.r-project.org/bin/linux/ubuntu/ trusty/

Step 2 :- Add secure key to check the new added link [click here for more info]

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9





Step 3 :- Check the apt by the following command

sudo apt-get update


Step 4 :- Now run the below command to install R
sudo apt-get install r-base
sudo apt-get install r-base-dev



Step 5 :- Now type R in the shell to get into R command prompt

R


Installing R studio [click here and get started for more info]

apt-get install libjpeg62
$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.1103-amd64.deb
$ sudo gdebi rstudio-server-0.98.1103-amd64.deb



http://localhost:8787

Installing Impala ODBC Driver in Ubuntu 64 bit

As of now you must know cloudera still do not provide any debian package for Impala ODBC driver so I have downloaded the rpm file for SUSE 11 64bit. Then I have converted it to a debian package file using the below command.

sudo apt-get install alien dpkg-dev debhelper build-essential
 
sudo alien ClouderaImpalaODBC-2.5.26.1027-1.x86_64.rpm
  
Now we will install the driver using the command:-

sudo dpkg -i clouderaimpalaodbc_2.5.26.1027-2_amd64.deb

Configuring ODBC Driver:-

Step 1 :- Edit .bashrc file and make the following entry

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/odbc
export ODBCINI=/etc/odbc.ini
export ODBCSYSINI=/etc
export CLOUDERAIMPALAINI=/opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
[check out the proper location of odbc.ini file using the command odbcinst -j
use three environment variables—ODBCINI, ODBCSYSINI, and
CLOUDERAIMPALAINI—to specify different locations for the odbc.ini, odbcinst.ini, and
cloudera.impalaodbc.ini configuration files by doing the following:
  • Set ODBCINI to point to your odbc.ini file.
  • Set ODBCSYSINI to point to the directory containing the odbcinst.ini file.
  • Set CLOUDERAIMPALAINI to point to your cloudera.impalaodbc.ini file.
For example, if your odbc.ini and odbcinst.ini files are located in /etc and your
cloudera.impalaodbc.ini file is located in /opt/cloudera/impalaodbc/lib/64, then set the environment variables as follows:
]
Step 2 :- ODBC driver managers use configuration files to define and configure ODBC data sources and
drivers. By default, the following configuration files residing in the user’s home directory are used:
 .odbc.ini is used to define ODBC data sources, and it is required.
 .odbcinst.ini is used to define ODBC drivers, and it is optional.

Also, by default the Cloudera ODBC Driver for Impala is configured using the
cloudera.impalaodbc.ini file, which is located in

 /opt/cloudera/impalaodbc/lib/64 for the 64-bit driver on Linux/AIX

Step 3 :- Configuring the odbc.ini File
ODBC Data Source Names (DSNs) are defined in the odbc.ini configuration file. The file is divided
into several sections:
 [ODBC] is optional and used to control global ODBC configuration, such as ODBC tracing.
 [ODBC Data Sources] is required, listing DSNs and associating DSNs with a driver.
 A section having the same name as the data source specified in the [ODBC Data Sources] section
is required to configure the data source.
The following is an example of an odbc.ini configuration file for Linux/AIX:

[ODBC Data Sources]
Sample_Cloudera_Impala_DSN_64=Cloudera Impala ODBC Driver 64-bit
[Sample_Cloudera_Impala_DSN_64]
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
HOST=localhost
PORT=21050


To create a Data Source Name:
1. Open the .odbc.ini configuration file in a text editor.
2. In the [ODBC Data Sources] section, add a new entry by typing the Data Source Name (DSN),
then an equal sign (=), and then the driver name.
3. In the .odbc.ini file, add a new section with a name that matches the DSN you specified in
step 2, and then add configuration options to the section. Specify configuration options as
key-value pairs.
4. Save the .odbc.ini configuration file.

Step 4 :- Configuring the odbcinst.ini File
ODBC drivers are defined in the odbcinst.ini configuration file. The configuration file is optional
because drivers can be specified directly in the odbc.ini configuration file.
The odbcinst.ini file is divided into the following sections:
 [ODBC Drivers] lists the names of all the installed ODBC drivers.
 A section having the same name as the driver name specified in the [ODBC Drivers] section
lists driver attributes and values.
The following is an example of an odbcinst.ini configuration file for Linux/AIX:

[ODBC Drivers]

Cloudera Impala ODBC Driver 64-bit=Installed
[Cloudera Impala ODBC Driver 64-bit]
Description=Cloudera Impala ODBC Driver (64-bit)
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so

To define a driver:
1. Open the .odbcinst.ini configuration file in a text editor.
2. In the [ODBC Drivers] section, add a new entry by typing the driver name and then typing
=Installed
3. In the .odbcinst.ini file, add a new section with a name that matches the driver name you
typed in step 2, and then add configuration options to the section based on the sample
odbcinst.ini file provided in the Setup directory. Specify configuration options as key-value
pairs.
4. Save the .odbcinst.ini configuration file.

Step 5 :- Configuring the cloudera.impalaodbc.ini File
The cloudera.impalaodbc.ini file contains configuration settings for the Cloudera ODBC Driver for
Impala. Settings that you define in the cloudera.impalaodbc.ini file apply to all connections that use the driver.

To configure the Cloudera ODBC Driver for Impala to work with your ODBC driver manager:
1. Open the cloudera.impalaodbc.ini configuration file in a text editor.
2. Edit the DriverManagerEncoding setting. The value is usually UTF-16 or UTF-32 if you are
using Linux/Mac OS X, depending on the ODBC driver manager you use. iODBC uses UTF-
32, and unixODBC uses UTF-16.
OR
If you are using AIX and the unixODBC driver manager, then set the value to UTF-16. If you
are using AIX and the iODBC driver manager, then set the value to UTF-16 for the 32-bit
driver or UTF-32 for the 64-bit driver.
3. Edit the ODBCInstLib setting. The value is the name of the ODBCInst shared library for the
ODBC driver manager you use. To determine the correct library to specify, refer to your
ODBC driver manager documentation.
The configuration file defaults to the shared library for iODBC. In Linux/AIX, the shared
library name for iODBC is libiodbcinst.so.
4. Optionally, configure logging by editing the LogLevel and LogPath settings. For more
information, see "Configuring Logging Options" on page 28.
5. Save the cloudera.impalaodbc.ini configuration file.

Step 6 :- Check the entry and configuration of ODBC by typing

odbcinst  -q -s
   
 isql -v Sample_Cloudera_Impala_DSN_64
 
Trouble Shooting :-
 
Well I have got one error like 
[S1000][unixODBC][Cloudera][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function.
 
which means that the driver is not linked to libodbcinst.so
Please check it first with the command
ldd /opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
  
then search for libodbcinst.so
 
find / -name "libodbcinst.so*" 
If not found then install it 

sudo apt-get update && sudo apt-get install unixodbc-dev libmyodbc
or
 
sudo apt-get install unixODBC unixODBC-dev
 
Then again try to search for libodbcinst.so
 
and make entry in .bashrc as
 
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
 
For more info click here and here and here and here. 

Hue Installation and Configuration in Ubuntu

Step 1 :- Install Hue


On Ubuntu or Debian systems:
  • On the Hue Server machine, install the hue package:
$ sudo apt-get install hue 
  • For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the hue-plugins package:
$ sudo apt-get install hue-plugins

Step 2 :- Configuring Hue

2.1. For WebHDFS only:

   2.1.1.  Add the following property in hdfs-site.xml to enable WebHDFS in the NameNode and DataNodes:

    <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
    </property>

    Restart your HDFS cluster.

Configure Hue as a proxy user for all other users and groups, meaning it may submit a request on behalf of any other user:

2.1.2. WebHDFS: Add to core-site.xml:

<!-- Hue WebHDFS proxy user setting -->
<property>
  <name>hadoop.proxyuser.hue.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hue.groups</name>
  <value>*</value>
</property>

2.1.3. With root privileges, update hadoop.hdfs_clusters.default.webhdfs_url in hue.ini to point to the address of either WebHDFS or HttpFS.

[hadoop]
[[hdfs_clusters]]
[[[default]]]
# Use WebHdfs/HttpFs as the communication mechanism.

WebHDFS:

...
webhdfs_url=http://FQDN:50070/webhdfs/v1/

2.2. MRv1 Configuration

Hue communicates with the JobTracker via the Hue plugin, which is a .jar file that should be placed in your MapReduce lib directory.

2.2.1. If your JobTracker and Hue Server are located on the same host, copy the file over. If you are currently using CDH 4, your MapReduce library directory might be in /usr/lib/hadoop/lib.

$ cd /usr/lib/hue
$ cp desktop/libs/hadoop/java-lib/hue-plugins-*.jar /usr/lib/hadoop-0.20-mapreduce/lib

If your JobTracker runs on a different host, scp the Hue plugins .jar file to the JobTracker host.

2.2.2. Add the following properties to mapred-site.xml:

<property>
  <name>jobtracker.thrift.address</name>
  <value>0.0.0.0:9290</value>
</property>
<property>
  <name>mapred.jobtracker.plugins</name>
  <value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value>
  <description>Comma-separated list of jobtracker plug-ins to be activated.</description>
</property>

You can confirm that the plugins are running correctly by tailing the daemon logs:

$ tail --lines=500 /var/log/hadoop-0.20-mapreduce/hadoop*jobtracker*.log | grep ThriftPlugin
2009-09-28 16:30:44,337 INFO org.apache.hadoop.thriftfs.ThriftPluginServer: Starting Thrift server
2009-09-28 16:30:44,419 INFO org.apache.hadoop.thriftfs.ThriftPluginServer:
Thrift server listening on 0.0.0.0:9290

2.3. Hive Configuration

The Beeswax daemon has been replaced by HiveServer2. Hue should therefore point to a running HiveServer2. This change involved the following major updates to the [beeswax] section of the Hue configuration file, hue.ini.

[beeswax]
  # Host where Hive server Thrift daemon is running.
  # If Kerberos security is enabled, use fully-qualified domain name (FQDN).
  ## hive_server_host=<FQDN of HiveServer2>

  # Port where HiveServer2 Thrift server runs on.
  ## hive_server_port=10000

Existing Hive Installation

In the Hue configuration file hue.ini, modify hive_conf_dir to point to the directory containing hive-site.xml.

2.4. HADOOP_CLASSPATH

If you are setting $HADOOP_CLASSPATH in your hadoop-env.sh, be sure to set it in such a way that user-specified options are preserved. For example:

Correct:

# HADOOP_CLASSPATH=<your_additions>:$HADOOP_CLASSPATH

Incorrect:

# HADOOP_CLASSPATH=<your_additions>

This enables certain components of Hue to add to Hadoop's classpath using the environment variable.

2.5. hadoop.tmp.dir

If your users are likely to be submitting jobs both using Hue and from the same machine via the command line interface, they will be doing so as the hue user when they are using Hue and via their own user account when they are using the command line. This leads to some contention on the directory specified by hadoop.tmp.dir, which defaults to /tmp/hadoop-${user.name}. Specifically, hadoop.tmp.dir is used to unpack JARs in /usr/lib/hadoop. One work around to this is to set hadoop.tmp.dir to /tmp/hadoop-${user.name}-${hue.suffix} in the core-site.xml file:

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}-${hue.suffix}</value>
</property>

Unfortunately, when the hue.suffix variable is unset, you'll end up with directories named /tmp/hadoop-user.name-${hue.suffix} in /tmp. Despite that, Hue will still work.


Step 3 :- Hue.ini configuration is completely available in cloudera.

Oozie Installation in Ubuntu

Step 1 :- To install the Oozie server package on an Ubuntu and other Debian system:
$ sudo apt-get install oozie
Step 2 :- To install the Oozie client package on an Ubuntu and other Debian system:
$ sudo apt-get install oozie-client

Step 3 :- Configuring which Hadoop Version to Use

To use MRv1(without SSL) :
alternatives --set oozie-tomcat-conf /etc/oozie/tomcat-conf.http.mr1

Step 4 :- Edit /etc/oozie/conf/oozie-env.sh file and make the entry


export CATALINA_BASE=/var/lib/oozie/tomcat-deployment

Step 5 :- Start the Oozie server

$ sudo service oozie start


Step 6 :- Accessing the Oozie Server with the Oozie Client

The Oozie client is a command-line utility that interacts with the Oozie server via the Oozie web-services API.

Use the /usr/bin/oozie script to run the Oozie client.

For example, if you want to invoke the client on the same machine where the Oozie server is running:

$ oozie admin -oozie http://localhost:11000/oozie -status
System mode: NORMAL

To make it convenient to use this utility, set the environment variable OOZIE_URL to point to the URL of the Oozie server. Then you can skip the -oozie option.

For example, if you want to invoke the client on the same machine where the Oozie server is running, set the OOZIE_URL to http://localhost:11000/oozie.

$ export OOZIE_URL=http://localhost:11000/oozie
$ oozie admin -version
Oozie server build version: 4.0.0-cdh5.0.0


Step 7 :- Confiduring MySQL for Oozie

Step 1: Create the Oozie database and Oozie MySQL user.

For example, using the MySQL mysql command-line tool:
$ mysql -u root -p
Enter password: ******

mysql> create database oozie;
Query OK, 1 row affected (0.03 sec)

mysql>  grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie';
Query OK, 0 rows affected (0.03 sec)

mysql>  grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie';
Query OK, 0 rows affected (0.03 sec)

mysql> exit
Bye

Step 2: Configure Oozie to use MySQL.

Edit properties in the oozie-site.xml file as follows:
...
    <property>
        <name>oozie.service.JPAService.jdbc.driver</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.url</name>
        <value>jdbc:mysql://localhost:3306/oozie</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.username</name>
        <value>oozie</value>
    </property>
    <property>
        <name>oozie.service.JPAService.jdbc.password</name>
        <value>oozie</value>
    </property>
    ...

Step 3 : Creating the Oozie DatabaseSchema

$ sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -sqlfile oozie-create.sql
 
or
 
$ sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -run
 

Step 4 : Enabling the Oozie Web Console

To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows.

Step 4.1: Download the Library

Download the ExtJS version 2.2 library from http://archive.cloudera.com/gplextras/misc/ext-2.2.zip and place it a convenient location.

Step 4.2: Install the Library

Extract the ext-2.2.zip file into /var/lib/oozie

$ cd Downloads/
$ sudo cp -avr ext-2.2 /var/lib/oozie/

Step 5 : Installing the Oozie Shared Library in Hadoop HDFS

The Oozie installation bundles the Oozie shared library, which contains all of the necessary JARs to enable workflow jobs to run streaming, DistCp, Pig, Hive, and Sqoop actions.
The Oozie installation bundles two shared libraries, one for MRv1 and one for YARN. Make sure you install the right one for the MapReduce version you are using:
  • The shared library file for MRv1 is oozie-sharelib-mr1.tar.gz.
  • The shared library file for YARN is oozie-sharelib-yarn.tar.gz.
 
sudo -u oozie oozie  admin -shareliblist -oozie http://localhost:11000/oozie
sudo service oozie restart

To install the Oozie shared library in Hadoop HDFS in the oozie user home directory

$ sudo -u hdfs hadoop fs -mkdir /user/oozie
$ sudo -u hdfs hadoop fs -chown oozie:oozie /user/oozie
$ sudo oozie-setup sharelib create -fs hdfs://localhost:8020 -locallib /usr/lib/oozie/oozie-sharelib-mr1

add the below line to oozie-site.xml to recognize the shared lib functionality

<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/etc/hadoop/conf</value>
<description>
Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of
the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is
used when there is no exact match for an authority. The HADOOP_CONF_DIR contains
the relevant Hadoop *-site.xml files. If the path is relative is looked within
the Oozie configuration directory; though the path can be absolute (i.e. to point
to Hadoop client conf/ directories in the local filesystem.
</description>
</property>

Configuring Support for Oozie Uber JARs

An uber JAR is a JAR that contains other JARs with dependencies in a lib/ folder inside the JAR. You can configure the cluster to handle uber JARs properly for the MapReduce action (as long as it does not include any streaming or pipes) by setting the following property in the oozie-site.xml file:
...
    <property>
        <name>oozie.action.mapreduce.uber.jar.enable</name>
    <value>true</value>

    ...
When this property is set, users can use the oozie.mapreduce.uber.jar configuration property in their MapReduce workflows to notify Oozie that the specified JAR file is an uber JAR.

Configuring Oozie to Run against a Federated Cluster

To run Oozie against a federated HDFS cluster using ViewFS, configure the oozie.service.HadoopAccessorService.supported.filesystems property in oozie-site.xml as follows:
<property>
     <name>oozie.service.HadoopAccessorService.supported.filesystems</name>
     <value>hdfs,viewfs</value>
</property>



 

Trouble shooting

sudo cp mysql-connector-java-5.1.35-bin.jar /var/lib/oozie/




[link]

Impala Installation in Ubuntu

Step 1 :- Install Impala

$ sudo apt-get install impala             # Binaries for daemons
$ sudo apt-get install impala-server      # Service start/stop script
$ sudo apt-get install impala-state-store # Service start/stop script
$ sudo apt-get install impala-catalog     # Service start/stop script


Step 2 :- Copy the client hive-site.xml, core-site.xml, hdfs-site.xml, and hbase-site.xml configuration files to the Impala configuration directory, which defaults to /etc/impala/conf. Create this directory if it does not already exist. 

$ sudo cp /etc/hadoop/conf/*.xml  /etc/impala/conf
$ sudo cp /etc/hive/conf/*.xml  /etc/impala/conf
$ sudo cp /etc/hbase/conf/*.xml  /etc/impala/conf

Step 3 :- Use  following commands to install impala-shell on the machines from which you want to issue queries. You can install impala-shell on any supported machine that can connect to DataNodes that are running impalad


$ sudo apt-get install impala-shell

Step 4 :- Post installation configuration

4.1. To configure DataNodes for short-circuit reads with CDH 4.2 or later:
On all Impala nodes, configure the following properties in Impala's copy of hdfs-site.xml as shown: 
[
Short-circuit reads make use of a UNIX domain socket. This is a special path in the filesystem that allows the client and the DataNodes to communicate. You will need to set a path to this socket. The DataNode needs to be able to create this path. On the other hand, it should not be possible for any user except the hdfs user or root to create this path. For this reason, paths under /var/run or /var/lib are often used.
Short-circuit local reads need to be configured on both the DataNode and the client.
]
$ sudo gedit /etc/impala/conf/hdfs-site.xml

<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>

<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>

<property>
    <name>dfs.client.file-block-storage-locations.timeout.millis</name>
    <value>10000</value>
</property>
 
 
 
[Note: The text _PORT appears just as shown; you do not need to
        substitute a number.
If /var/run/hadoop-hdfs/ is group-writable, make sure its group
        is root or hdfs.
This is a path to a UNIX domain socket that will be used for
    communication between the DataNode and local HDFS clients.
    If the string "_PORT" is present in this path, it will be replaced by the
    TCP port of the DataNode.
  ] 
[
<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hdfs-sockets/dn</value>
</property>
this configuration also works 
]


 

To enable block location tracking:
For each DataNode, adding the following to the hdfs-site.xml file:

<property>
  <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
  <value>true</value>
</property> 
 


 
4.2. Set IMPALA_CONF_DIR environment variable
 
$ sudo gedit .bashrc
 
export IMPALA_CONF_DIR=/etc/impala/conf 
 

 
4.3. Modify hdfs-site.xml file in  /etc/hadoop/conf like below
 
<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>

<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>

<property>
    <name>dfs.client.file-block-storage-locations.timeout.millis</name>
    <value>10000</value>
</property>
 
<property>
  <name>dfs.datanode.hdfs-blocks-metadata.enabled</name>
  <value>true</value>
</property>  

[
Mandatory: Block Location Tracking
Enabling block location metadata allows Impala to know which disk data blocks are located on, allowing better utilization of the underlying disks. Impala will not start unless this setting is enabled
] 
 
Restart all the datanodes... 
 
 Start the statestore service using a command similar to the following:

$ sudo service impala-state-store start

Start the catalog service using a command similar to the following:

$ sudo service impala-catalog start

Start the Impala service on each data node using a command similar to the following:

$ sudo service impala-server start
 
Log in to Impala Shell
 
impala-shell -i localhost 
 

Step 5 :- Configuring Impala with ODBC
Step 6 :- Configuring Impala with ODBC 
Step 7 :- Starting Impala

HBASE Installation in Ubuntu

Step 1 :- Install HBASE

$ sudo apt-get install hbase


Step 2 :- To list the installed files on Ubuntu and Debian systems:


$ dpkg -L hbase

Step 3 :-  Enable Java-based client access

$ sudo gedit .bashrc
export CLASSPATH=$CLASSPATH:/usr/lib/hbase/*:.
export CLASSPATH=$CLASSPATH:
/usr/lib/hbase/lib/*:.

Step 4 :- Setting the ulimit in for the users

$ sudo gedit /etc/security/limits.conf

hdfs  -       nofile  32768
hdfs  -       nproc   2048
hbase -       nofile  32768
hbase -       nproc   2048
 
To apply the changes in /etc/security/limits.conf on Ubuntu and Debian systems, add the following line in the /etc/pam.d/common-session file:

session required pam_limits.so
 

Step 5 :- Using dfs.datanode.max.transfer.threads with HBase


A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The upper bound is controlled by the dfs.datanode.max.transfer.threads property (the property is spelled in the code exactly as shown here). Before loading, make sure you have configured the value for dfs.datanode.max.transfer.threads in the conf/hdfs-site.xml file (by default found in /etc/hadoop/conf/hdfs-site.xml) to at least 4096 as shown below:

<property>
  <name>dfs.datanode.max.transfer.threads</name>
  <value>4096</value>
</property>

Step 6 :- Installing the HBase Master

$ sudo apt-get install hbase-master

Step 7 :- Configuring HBase in Pseudo-Distributed Mode

7.1.  Modifying the HBase Configuration
To enable pseudo-distributed mode, you must first make some configuration changes. Open /etc/hbase/conf/hbase-site.xml in your editor of choice, and insert the following XML properties between the <configuration> and </configuration> tags. The hbase.cluster.distributed property directs HBase to start each process in a separate JVM. The hbase.rootdir property directs HBase to store its data in an HDFS filesystem, rather than the local filesystem. Be sure to replace myhost with the hostname of your HDFS NameNode (as specified by fs.default.name or fs.defaultFS in your conf/core-site.xml file); you may also need to change the port number from the default (8020).

<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://localhost:8020/hbase</value>
</property>

7.2.  Creating the /hbase Directory in HDFS


Before starting the HBase Master, you need to create the /hbase directory in HDFS. The HBase master runs as hbase:hbase so it does not have the required permissions to create a top level directory.
To create the /hbase directory in HDFS:
$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase

7.3.  Starting the HBase Master

After ZooKeeper is running, you can start the HBase master in standalone mode.

$ sudo service hbase-master start


7.4. Starting an HBase RegionServer

The RegionServer is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node

To enable the HBase RegionServer on Ubuntu and Debian systems:

$ sudo apt-get install hbase-regionserver

To start the RegionServer:

$ sudo service hbase-regionserver start

[You should be able to navigate to http://localhost:60010 and verify that the local RegionServer has registered with the Master.]

Step 8 :-Installing and Starting the HBase Thrift Server

The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is multiplatform and more performant than REST in many situations. Thrift can be run collocated along with the region servers, but should not be collocated with the NameNode or the JobTracker.

To enable the HBase Thrift Server on Ubuntu and Debian systems:
$ sudo apt-get install hbase-thrift
To start the Thrift server:
$ sudo service hbase-thrift start
 
Step 9 :- Configuring for Distributed Operation

After you have decided which machines will run each process, you can edit the configuration so that the nodes can locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such as rsync to get started quickly.
The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address in hbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>localhost</value>
</property>
The hbase.zookeeper.quorum property is a comma-separated list of hosts on which ZooKeeper servers are running. If one of the ZooKeeper servers is down, HBase will use another from the list. By default, the ZooKeeper service is bound to port 2181. To change the port, add the hbase.zookeeper.property.clientPort property to hbase-site.xml and set the value to the port you want ZooKeeper to use.

Trouble Shooting [https://hbase.apache.org/book.html]

Though above steps are able enough to run HBASE successfully but if it fails with an error like JAVA_HOME not set then do the following :-

$ sudo gedit /etc/hbase/conf/hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0


PIG Installation on Ubuntu

Step 1 : Install PIG from Cloudera repository

$ sudo apt-get install pig

Step 2 : For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows: [In case it is not already updated]

$ sudo gedit .bashrc
 
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

Step 3 : To start Pig in interactive mode (MRv1)

$ pig


Step 4 :  Examples

grunt> ls
hdfs://localhost/user/joe/input <dir>
grunt> A = LOAD 'input';
grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*';
grunt> DUMP B; 
 
[For this example to run you need input directory to be created. Incase you
already have not created it in our previous mentioned steps of Hadoop Installation
 please create it:
 
$ sudo -u hdfs hadoop fs -mkdir -p /user/$USER

$ sudo -u hdfs hadoop fs -chown $USER /user/$USER

$ hadoop fs -mkdir input

$ hadoop fs -put /etc/hadoop/conf/*.xml input

$ hadoop fs -ls input ] 

Hadoop Ecosystems

Well though there are many Ecosystem of Hadoop but their ues is very purposeful to your project. Like for my Project of building and analyzing  a datawarehouse for banking I need the below ecosystem.

Hive :- SQL like data base that work on Hadoop MR framework for analyzing the raw data first.
PIG :- An ecosystem that enables various raw data transformation format to an understandable and aggregated  format that HDFS can understand and store.
Impala :- An Inmemory columnar database that works with HDFS and is more faster than Hive.
HBASE :- NoSQL database to handle mainly unstructured data.
Oozie :- To schedule a workflow in Hadoop
Hue :- A web based interface of hadoop to manipulate cli options for the ecosystems.

Apart from that in the ecosystem I have used R for analytic purpose and Splunk for graphical reporting. But I have been now working with Tableau.

Hive Installation in Ubuntu

Installing Hive from Cloudera is very simple and it needs to follow below simple steps :-

1. sudo apt-get install hive hive-metastore hive-server2 hive-hbase
2. sudo apt-get install hive-jdbc
3. Add /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jar to your classpath.

$sudo gedit .bashrc
export HIVE_HOME=/usr/lib/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/lib/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/lib/hive/lib/*:.

$ cd $HIVE_HOME/conf
$ sudo cp hive-env.sh.template hive-env.sh

$sudo gedit hive-env.sh
export HADOOP_HOME=/usr/lib/hadoop

Well that will do enough to install hive but you need to do bit more configuration for metastore.

Step 1 :- You first need to install MySql

$ sudo apt-get install mysql-server
$ sudo service mysql start
$ sudo apt-get install libmysql-java
$ sudo ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar [to be done after installing hive]
$ sudo /usr/bin/mysql_secure_installation
$ sudo apt-get install sysv-rc-conf

Step 2 :- Create metastore database in mysql and user

$ sudo sysv-rc-conf mysql on
$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql;

mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'mypassword';
...
mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> quit;

Step 3 :- Configure Hive Site xml file to make Hive use the metastore

sudo gedit /usr/lib/hive/conf/hive-site.xml

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore</value>
  <description>the URL of the MySQL database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
  <name>datanucleus.autoStartMechanism</name>
  <value>SchemaTable</value>
</property>

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://localhost:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>


<property>
  <name>hive.support.concurrency</name>
  <description>Enable Hive's Table Lock Manager Service</description>
  <value>true</value>
</property>

<property>
  <name>hive.zookeeper.quorum</name>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
  <value>localhost</value>
</property>



<property>
  <name>hive.zookeeper.client.port</name>
  <value>2181</value>
  <description>
  The port at which the clients will connect.
  </description>
</property>

Step 4 :- Create the below directory for hive to access


sudo -u hdfs hadoop fs -mkdir -p /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod g+w /user/hive/warehouse

Step 5 :- Trouble Shooting Hive

Though the above steps will be enough to run hive successfully but in case it is not running you need to check the log files in /var/log/hive directory
I have faced two problem even after successfully installing it.

1. BOPTM connection failure to use it's metastore.[https://hadooptutorial.info/datastore-driver-was-not-found/]

For this you need to download latest version of mysql connector and install it in the below specified way.

$ cd Downloads/
$ tar -xzf mysql-connector-java-5.1.35.tar.gz
$ cd mysql-connector-java-5.1.35/
$ sudo cp mysql-connector-java-5.1.35-bin.jar $HIVE_HOME/lib/

2. Unknown column 'OWNER_NAME' in 'field list' [https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/CDH-upgrade-from-4-7-to-CDH-5-2-hive-metastore-issue/td-p/20626]

This has happened for the previous step where we have run  SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql;. This will only valid for hive version 0.8 but as our new hive version is 1.1.0 so we have to run the code

SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-1.1.0.mysql.sql;

but this file will have a reference to txn-0.13.0 schema sql file but the entire path is not mentioned on the schema sql file hence make sure you modify the file hive-schema-1.1.0.mysql.sql and give the full path as /usr/lib/hive/scripts/metastore/upgrade/mysql/txn-0.13.0.mysql.sql