Hadoop for Beginner: June 2015

Installing Cassandra in Ubuntu

cd /tmp
wget http://www.us.apache.org/dist/cassandra/2.1.6/apache-cassandra-2.1.6-bin.tar.gz
tar -xvzf apache-cassandra-2.1.6-bin.tar.gz
mv apache-cassandra-2.1.6 ~/cassandra

sudo mkdir /var/lib/cassandra
sudo mkdir /var/log/cassandra
sudo chown -R $USER:$GROUP /var/lib/cassandra
sudo chown -R $USER:$GROUP /var/log/cassandra

sudo gedit .bashrc

export CASSANDRA_HOME=~/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin

sudo sh ~/cassandra/bin/cassandra
sudo sh ~/cassandra/bin/cassandra-cli

Installing Python in Ubuntu

$ sudo apt-get install python2.7

$ sudo apt-get install python2.7-dev

you need to install pip first

$ sudo apt-get install python-pip

check pip is working or not

$ pip

$ sudo pip install numpy

$ sudo pip install ipython

$ sudo pip install pandas

type ipython or

ipython notebook

Hive Interview Question

Wipro :-

1. Write syntax to hive creating a table and explain each part.
2. What is location stands for in that syntax?
3. What is stored as command do? how many type of files are there? what are their difference?
4. What is serde ? Why you use it? What are different format of Serde ?
5. How to process an unbounded XML file with schema defined in hive ?
6. What is UDF and UTDF? what are the difference between them ?
7. What is RC and ORC file ? and why they have been used for?

Common :-
1. How to load bulk data in hive partition?
2. What are the drawbacks of Hive?
3. What hive and hadoop version you have worked on ?
4. How to do update and delete in in Hive?
5. Incremental update in Hive ?

Hadoop Cluster Configuration Files

In last few years Apache Hadoop has emerged as the technology for solving Big Data problems and for improved Business Analytics. One example of this is How Sears Holding has moved to Hadoop from the traditional Oracle Exadata, Teradata, SAS system. Another recent big entrant to Hadoop bandwagon is Walmart’s Hadoop implementation.
In edureka blog they have discussed, how to create a Hadoop Cluster on AWS in 30 minutes.
In continuation to that, this blog talks about important Hadoop Cluster Configuration Files.
The following table lists the same.

All these files are available under ‘conf’ directory of Hadoop installation directory.

Here is a listing of these files in the File System:

Let’s look at the files and their usage one by one!

hadoop-env.sh

This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.

This file is also used for setting another Hadoop daemon execution environment such as heap size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location (HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary parameters to start a cluster.
The following three files are the important configuration files for the runtime environment settings of a Hadoop cluster.

core-site.sh

This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

Name node, Hadoop daemon, Configuration settings,Hadoop Core

Where hostname and port are the machine and port on which NameNode daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 8020 and you can also specify IP address rather than hostname.

hdfs-site.sh

This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission checking on HDFS. The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.
The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.

mapred-site.sh

This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker to Task Trackers and MapReduce clients.

You can replicate all of the four files explained above to all the Data Nodes and Secondary Namenode. These files can then be configured for any node specific configuration e.g. in case of a different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop cluster.

Masters

This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.

Secondary Namenode location, hadoop daemon

The ‘masters’ file on Slave Nodes is blank.

Slaves

The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node and Task Tracker servers.

The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in the cluster.

Banking Case Study

Workflow :-
1. Group the data of loan by loan id
2. Group the client data by client id
3. Generate Age from client data and store
4. Get the transaction data for last year
5. Sum up the data based on transaction type and amount
6. Group the card data by disposition id
7. Group the district data by district id
8. Filter out the un employment data for the year 95 & 96 from district
9. Generate the difference between the unemployment data for every district for those two years
10. Group the disposition data
11. Joining :-
join loan,transaction,Account,Disposition,on ac_id as ac_id_join
join ac_id_join,district_info,client on district_id as include_district
join include_district,card on disposition_id as join_done
select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday

12. Algorithm used to predict excellent, good and risky customers:

12.1.within 1 year {

if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)

write in a file called good more loan can be granted card can be upgrade

12.2. if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80

write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan

12.3. if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100

write in a file called risky no more loans card must be downgraded

}

Banking Domain Case Study in Hadoop and R

In this blog and the next few ones that will follow, we will analyze a banking domain dataset, which contains several files with details of its customers. This database was prepared by Petr Berka and Marta Sochorova.

The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.

By the time you finish reading this blog, you would have learned :

How to analyze a bank’s data to predict a customer’s quality
Using this analysis we can categorize a customer into three categories:

Excellent: Customers whose record is good with the bank
Good: Customers who have average earning with a good record till now
Risky: Customers who are under debt of bank or who has not paid the loan on time

How to write PIG UDF
How to connect Hadoop with R
How to load data from Hadoop to R

How to analyze a bank’s data to predict the customer’s quality

Prerequisite

Software Technology

Java installed Hadoop concepts
Hadoop installed Java concepts
Pig installed Pig concepts
R-base
Rstudio
Ubuntu OS

View the detail case study here.

HTML/JavaScript

document.write(ssyby);

Installing Cassandra in Ubuntu

document.write(ssyby);

Installing Python in Ubuntu

document.write(ssyby);

Hive Interview Question

document.write(ssyby);

Hadoop Cluster Configuration Files

hadoop-env.sh

core-site.sh

hdfs-site.sh

mapred-site.sh

Masters

Slaves

document.write(ssyby);

Banking Case Study

document.write(ssyby);

Banking Domain Case Study in Hadoop and R