Open Data Link

UK Transport Data :- https://tfl.gov.uk/info-for/open-data-users/

Flume Installation and Streaming Twitter Data Using Flume


Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to:
  • Stream data from multiple sources into Hadoop for analysis
  • Collect high-volume Web logs in real time
  • Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
  • Guarantee data delivery
  • Scale horizontally to handle additional data volume
Flume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend. The project team has designed Flume with the following components:
  • Event – a singular unit of data that is transported by Flume (typically a single log entry
  • Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
  • Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
  • Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
  • Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
  • Client – produces and transmits the Event to the Source operating within the Agent

1

A flow in Flume starts from the Client (Web Server). The Client transmits the event to a Source operating within the Agent. The Source receiving this event then delivers it to one or more Channels. These Channels are drained by one or more Sinks operating within the same Agent. Channels allow decoupling of ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than what the provisioned capacity on the destination can handle, the channel size increases. This allows sources to continue normal operation for the duration of the spike. Flume agents can be chained together by connecting the sink of one agent to the source of another agent. This enables the creation of complex dataflow topologies.
Now we will install apache flume on our virtual machine.

STEP 1:

Download flume:
Command: wget http://archive.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz



Command: ls

3


STEP 2:

Extract file from flume tar file.
Command: tar -xvf apache-flume-1.4.0-bin.tar.gz
Command: ls

4


STEP 3:

Put apache-flume-1.4.0-bin directory inside /usr/lib/ directory.

Command: sudo mv apache-flume-1.4.0-bin /usr/lib/

5


STEP 4:

We need to remove protobuf-java-2.4.1.jar and guava-10.1.1.jar from lib directory of apache-flume-1.4.0-bin ( when using hadoop-2.x )

Command: sudo rm /usr/lib/apache-flume-1.4.0-bin/lib/protobuf-java-2.4.1.jar /usr/lib/apache-flume-1.4.0-bin/lib/guava-10.0.1.jar

6


STEP 5:

Use below link and download flume-sources-1.0-SNAPSHOTS.jar
https://drive.google.com/file/d/0B-Cl0IfLnRozUHcyNDBJWnNxdHc/view?usp=sharing


7

Save the file.

8

STEP 6:

Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:

Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/

9


STEP 7:

Check whether flume SNAPSHOT has moved to the lib folder of apache flume:

Command: ls /usr/lib/apache-flume-1.4.0-bin/lib/flume*

10

STEP 8:

Copy flume-env.sh.template content to flume-env.sh

Command: cd /usr/lib/apache-flume-1.4.0-bin/

Command: sudo cp conf/flume-env.sh.template conf/flume-env.sh

11


STEP 9:

Edit flume-env.sh as mentioned in below snapshot.

command: sudo gedit conf/flume-env.sh

12


Set JAVA_HOME and FLUME_CLASSPATH as shown in below snapshot.

13

Now we have installed flume on our machine. Lets run flume to stream twitter data on to HDFS.
We need to create an application in twitter and use its credentials to fetch data.

STEP 10:

Open a Browser and go to the below URL:

URL:https://twitter.com/

14


STEP 11:

Enter your Twitter account credentials and sign in:

15


STEP 12:

Your twitter home page will open:

16


STEP 13:

Change the URL to https://apps.twitter.com

17

STEP 14:

Click on Create New App to create a new application and enter all the details in the application:

18

STEP 15:

Check Yes, I agree and click on Create your Twitter application:

19

STEP 16:

Your Application will be created:

20

STEP 17:

Click on Keys and Access Tokens, you will get Consumer Key and Consumer Secret.

21

STEP 18:

Scroll down and Click on Create my access token:

22

Your Access token got created:

23
Consumer Key (API Key) 4AtbrP50QnfyXE2NlYwROBpTm
Consumer Secret (API Secret) jUpeHEZr5Df4q3dzhT2C0aR2N2vBidmV6SNlEELTBnWBMGAwp3
Access Token 1434925639-p3Q2i3l2WLx5DvmdnFZWlYNvGdAOdf5BrErpGKk
Access Token Secret AghOILIp9JJEDVFiRehJ2N7dZedB1y4cHh0MvMJN5DQu7

STEP 19:

Use below link to download flume.conf file
https://drive.google.com/file/d/0B-Cl0IfLnRozdlRuN3pPWEJ1RHc/view?usp=sharing

24
Save the file.


25

STEP 20:

Put the flume.conf in the conf directory of apache-flume-1.4.0-bin
Command: sudo cp /home/centos/Downloads/flume.conf /usr/lib/apache-flume-1.4.0-bin/conf/

26

STEP 21:

Edit flume.conf

Command: sudo gedit conf/flume.conf

Replace all the below highlighted credentials in flume.conf with the credentials (Consumer Key, Consumer Secret, Access Token, Access Token Secret) you received after creating the application very carefully, rest all will remain same, save the file and close it.

27



28

STEP 22:

Change permissions for flume directory.

Command: sudo chmod -R 755 /usr/lib/apache-flume-1.4.0-bin/
29


STEP 23:

Start fetching the data from twitter:

Command: ./bin/flume-ng agent -n TwitterAgent -c conf -f /usr/lib/apache-flume-1.4.0-bin/conf/flume.conf
30



31

Now wait for 20-30 seconds and let flume stream the data on HDFS, after that press ctrl + c to break the command and stop the streaming. (Since you are stopping the process, you may get few exceptions, ignore it)

STEP 24:

Open the Mozilla browser in your VM, and go to /user/flume/tweets in HDFS

Click on FlumeData file which got created:

32

If you can see data similar as shown in below snapshot, then the unstructured data has been streamed from twitter on to HDFS successfully. Now you can do analytics on this twitter data using Hive.


33

Analytics Tutorial: Learn Linear Regression in R

The R-Factor

There is often a gap in what we are taught in college and the knowledge that we need to possess to be successful in our professional lives. This is exactly what happened to me when I joined a consultancy firm as a business analyst. At that time I was a fresher coming straight from the cool college atmosphere, newly exposed to the Corporate Heat.
One day my boss called me to his office and told me that one of their clients, a big insurance company, was facing significant losses on auto insurance. They had hired us to identify and quantify the factors responsible for it. My boss emailed me the data that the company had provided and asked me to do a multivariate linear regression analysis on it. My boss told me to use R and make a presentation of the summary.
Now as a statistics student I was quite aware of the principles of a multivariate linear regression, but I had never used R. For those of you who are not aware, R is a statistical programming language. It is a very powerful tool and widely used across the world in analyzing data. Of course, I did not know this at that time.
Anyways, it took me a lot of surfing on the internet and reading books to learn how to fit my model in R. and now I want to help you guys save that time!
R is an open source tool easily available on the internet. I'll assume you have it installed on your computer. Else you can easily download and install it from www.r-project.org/
I have already converted the raw data file from the client into a clean .csv (comma separated) file. click here to download the file.
I've saved this on the D drive of computer in a folder called Linear_Reg_Sample. You can save it anywhere, but remember to change the path wherever a file path is mentioned.
Open the R software that you've installed. It's time to get started!

Let's Start Regression in R

The first thing to do is obviously read all our data in R. This can be easily done using the command: >LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv")
Here we read all the data into an object LinRegData, using a function read.csv().
NOTE: If you observe closely, you'll see that we have used \\ instead of a \. This is because of the construct of the language. Whenever you enter a path, make sure to use \\
Let's see if our data has been read by R. Use the following command to get a summary of the data: >summary(LinRegData)
This will give output
Summary of the Data
Image 1: Summary of input data
In the output you can see the distribution of data. The min, max, median, mean are shown for all the variables.

Performing the Regression Analysis

Now that the data has been loaded, we need to fit a regression model over it.
We will use the following command in R to fit the model:  >FitLinReg <- lm(Capped_Losses ~ Number_Vehicles + Average_Age + Gender_Dummy + Married_Dummy + Avg_Veh_Age + Fuel_Type_Dummy, LinRegData)
In this command, we create an object FitLinReg and store the results of our regression model in it. The lm() function is used to fit the model. Inside the model, Capped_Losses is our dependent variable which we are trying to explain using the other variables that are separated by a + sign. The last parameter of the formula is the source of the data.
If no error is displayed, it means our regression is done and the results are stored in FitLinReg. We can see the results using two commands:
 
1. >FitLinReg
This gives the output:
 
image
 
 
2. >summary(FitLinReg)
This gives the output:
Regression in R

The summary command gives us the intercepts of each variable, its standard error, t value and significance.
The output also tells us what the significance level of each variable is. For e.g., a *** variable highly is significant, a ** variable is significant at the 99.9% level and a space next to the variable indicates that it is not significant.
We can easily see that the Number_Vehicles variable is not significant and does not affect the model. We can remove this variable from the model.
If you go through what we've done till now, you will realize that it took us just two commands to fit a multivariate model in R. See how simple life has become!!!

Happy Ending!

In this way I learnt how to fit a regression model using R. I made a summary of my findings and made a presentation to the clients.
Linear Regression in R 
My boss was rather happy with me and I received a hefty bonus that year.

Hadoop Administration Interview Questions and Answers


https://www.dezyre.com/hadoop-tutorial/hadoop-multinode-cluster-setup


It is essential to prepare yourself in order to pass an interview and land your dream job. Here’s the first step to achieving this. The following are some frequently asked Hadoop Administration interview questions and answers that might be useful.

Explain check pointing in Hadoop and why is it important?

Check pointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient Namenode recovery and restart and is an important indicator of overall cluster health.
Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is  to store the HDFS namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essential that this metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

What is default block size in HDFS and what are the benefits of having smaller block sizes?

Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file. Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of megabytes, or gigabytes each.

What are two main modules which help you interact with HDFS and what are they used for?

user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole.

How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?

Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that image and log could be restored from the remaining disks/volumes if one of the disks fails.

How do you read a file from HDFS?

The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The client’s validated request is checked against the owner and permissions of the file.
Step 6: If the file exists and the user has access to it then the NameNode responds with the first block id and provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
Step 7: The client now contacts the most appropriate datanode directly and reads the block data. This process repeats until all blocks in the file have been read or the client closes the file stream.
If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case the information returned by the NameNode about block locations are outdated by the time the client attempts to contact a datanode, a retry will occur if there are other replicas or the read will fail.

What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?

Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the jobtracker. The three types of schedulers are:
  • FIFO (First in First Out) Scheduler
  • Fair Scheduler
  • Capacity Scheduler

How do you decide which scheduler to use?

The CS scheduler can be used under the following situations:
  • When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
  • When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
  • When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
  • When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
  • When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
  • When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
  • When you require jobs within a pool to make equal progress rather than running in FIFO order.

Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?

DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.
If these paramters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be lost and is critical if Namenode is restarted, as formatting information will be lost.

What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?

FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files under the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing by a client. To list such files, run FSCK with -openforwrite option.
FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks, corrupt blocks and missing replicas.

What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?

The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
  • Hadoop-env.sh
  • Core-site.xml
  • Hdfs-site.xml
  • Mapred-site.xml
  • Masters
  • Slaves
These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.

Installing Cassandra in Ubuntu

cd /tmp
wget http://www.us.apache.org/dist/cassandra/2.1.6/apache-cassandra-2.1.6-bin.tar.gz
tar -xvzf apache-cassandra-2.1.6-bin.tar.gz
mv apache-cassandra-2.1.6 ~/cassandra

sudo mkdir /var/lib/cassandra
sudo mkdir /var/log/cassandra
sudo chown -R $USER:$GROUP /var/lib/cassandra
sudo chown -R $USER:$GROUP /var/log/cassandra

sudo gedit .bashrc

export CASSANDRA_HOME=~/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin


sudo sh ~/cassandra/bin/cassandra
sudo sh ~/cassandra/bin/cassandra-cli

Installing Python in Ubuntu



$ sudo apt-get install python2.7

$ sudo apt-get install python2.7-dev
 
you need to install pip first

$ sudo apt-get install python-pip


check pip is working or not

$ pip

$ sudo pip install numpy

$ sudo pip install ipython

$ sudo pip install pandas

type ipython or 
ipython notebook 
 
 

Hive Interview Question

Wipro :-

1. Write syntax to hive creating a table and explain each part.
2. What is location stands for in that syntax?
3. What is stored as command do? how many type of files are there? what are their difference?
4. What is serde ? Why you use it? What are different format of Serde ?
5. How to process an unbounded XML file with schema defined in hive ?
6. What is UDF and UTDF? what are the difference between them ?
7. What is RC and ORC file ? and why they have been used for?

Common :-
1. How to load bulk data in hive partition?
2. What are the drawbacks of Hive?
3. What hive and hadoop version you have worked on ?
4. How to do update and delete in in Hive?
5. Incremental update in Hive ?

Hadoop Cluster Configuration Files



Hadoop Cluster Configuration files
In last few years Apache Hadoop has emerged as the technology for solving Big Data problems and for improved Business Analytics. One example of this is How Sears Holding has moved to Hadoop from the traditional Oracle Exadata, Teradata, SAS system. Another recent big entrant to Hadoop bandwagon is Walmart’s Hadoop implementation.
In edureka blog they have discussed, how to create a Hadoop Cluster on AWS in 30 minutes.
In continuation to that, this blog talks about important Hadoop Cluster Configuration Files.
The following table lists the same.
Configuration files in Hadoop Cluster
All these files are available under ‘conf’ directory of Hadoop installation directory.

Here is a listing of these files in the File System:
Hadoop Cluster Configuration Files
Let’s look at the files and their usage one by one!

hadoop-env.sh

This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.
Java path in Hadoop
This file is also used for setting another Hadoop daemon execution environment such as heap size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location (HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary parameters to start a cluster.
The following three files are the important configuration files for the runtime environment settings of a Hadoop cluster.

core-site.sh

This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
Name node, Hadoop daemon, Configuration settings,Hadoop Core
Where hostname and port are the machine and port on which NameNode daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 8020 and you can also specify IP address rather than hostname.

hdfs-site.sh

This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.
You can also configure hdfs-site.xml to specify default block replication and permission checking on HDFS. The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.
The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.
HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes

mapred-site.sh

This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker to Task Trackers and MapReduce clients.
MapReduce daemons; the job tracker and the task-trackers
You can replicate all of the four files explained above to all the Data Nodes and Secondary Namenode. These files can then be configured for any node specific configuration e.g. in case of a different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop cluster.

Masters

This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.
 Secondary Namenode location, hadoop daemon
The ‘masters’ file on Slave Nodes is blank.

Slaves

The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node and Task Tracker servers.
Slaves file, Master node, Hadoop
The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in the cluster.

Banking Case Study

Workflow :-
1. Group the data of loan by loan id
2. Group the client data by client id
3. Generate Age from client data and store
4. Get the transaction data for last year
5. Sum up the data based on transaction type and amount
6. Group the card data by disposition id
7. Group the district data by district id
8. Filter out the un employment data for the year 95 & 96 from district
9. Generate the difference between the unemployment data for every district for those two years
10. Group the disposition data
11. Joining :-
join loan,transaction,Account,Disposition,on ac_id as ac_id_join
join ac_id_join,district_info,client on district_id as include_district
join include_district,card on disposition_id as join_done
select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday

12. Algorithm used to predict excellent, good and risky customers:

12.1.within 1 year {

if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)

write in a file called good more loan can be granted card can be upgrade

12.2.  if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80

write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan

12.3.  if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100

write in a file called risky no more loans card must be downgraded

}

Banking Domain Case Study in Hadoop and R

In this blog and the next few ones that will follow, we will analyze a banking domain dataset, which contains several files with details of its customers. This database was prepared by Petr Berka and Marta Sochorova.
The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.
By the time you finish reading this blog,  you would have learned :
  • How to analyze a bank’s data to predict a customer’s quality
  • Using this analysis we can categorize a customer into three categories:
  1. Excellent: Customers whose record is good with the bank
  2. Good: Customers who have average earning with a good record till now
  3. Risky: Customers who are under debt of bank or who has not paid the loan on time
  • How to write PIG UDF
  • How to connect Hadoop with R
  • How to load data from Hadoop to R
How to analyze a bank’s data to predict the customer’s quality
Prerequisite
Software Technology
  • Java installed Hadoop concepts
  • Hadoop installed Java concepts
  • Pig installed Pig concepts
  • R-base
  • Rstudio
  • Ubuntu OS


View the detail case study here.

Creating Your First Map Reduce Programme

Opening the New Java Project wizard

The New Java Project wizard can be used to create a new java project. There are many ways to open this wizard:
  • By clicking on the File menu and choosing New > Java Project
  • By right clicking anywhere in the Project Explorer and selecting New > Java Project
  • By clicking on the New button ( ) in the Tool bar and selecting Java Project

Using the New Java Project wizard

The New Java Project Wizard has two pages.
On the first page:
  • Enter the Project Name
  • Select the Java Runtime Environment (JRE) or leave it at the default
  • Select the Project Layout which determines whether there would be a separate folder for the sources code and class files. The recommended option is to create separate folders for sources and class files.

    You can click on the Finish button to create the project or click on the Next button to change the java build settings.
    On the second page you can change the Java Build Settings like setting the Project dependency (if there are multiple projects) and adding additional jar files to the build path.

     Writing the Mapper Class

    As we all start up with writing some basic code for map reduce hence we will write a Word Count program which will simply count the number of words in a file and give a out put.

    Now here in the mapper class we write WordCountMapper


    package com.hadoop.training;

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    import java.util.StringTokenizer;

    public class WordCountMapper  extends Mapper<LongWritable,Text,Text,IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map (LongWritable key,Text value, Context context) throws IOException,InterruptedException {

    StringTokenizer itr = new  StringTokenizer(value.toString());

    while (itr.hasMoreTokens()){
    word.set(itr.nextToken());
    context.write(word,one);

    }

    }

    }

    Writing the Reducer Class 

    Now here in the reducer class we write WordCountReducer
     
    package com.hadoop.training;

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;

    public class  WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

        private IntWritable result = new IntWritable();

    public void reduce(Text key,Iterable<IntWritable> value, Context context) throws IOException,InterruptedException {
     int sum = 0;
    for (IntWritable val : value) {
    sum +=val.get();

    }
    result.set(sum);
     context.write(key,result);
    }

    }


    Writing the MapReduce driver class


    Writing the MapReduce driver class as WordCount

    package com.hadoop.training;

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class WordCount {

    public static void main (String[] args) throws Exception {

     if (args.length != 2) {
    System.err.println("Usage: MaxTemperature <input path> <output path>");
    System.exit(-1);
    }

    @SuppressWarnings("deprecation")
    Job job = new Job();
    job.setJarByClass(WordCount.class);
    job.setJobName("Word Count");

     FileInputFormat.addInputPath(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(WordCountMapper.class);
    job.setReducerClass(WordCountReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

     System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    }

    Running The Map Reduce programme


    $ hadoop jar WC.jar com.hadoop.training.WordCount hdfs://localhost:8020/user/rajeev/input hdfs://localhost:8020/user/rajeev/output

Eclipse Installation in Ubuntu

  1. Open a terminal (Ctrl-Alt-T) and switch it to root permissions by entering:
    $ sudo su
  2. Make sure Eclipse Indigo is NOT installed in your Ubuntu. You may need to remove both "eclipse" and "eclipse-platform" packages to get rid of it. If it still gets into way when trying to install Luna using this easy way, you may need to look at the "hard way" below.
    # apt-get remove eclipse eclipse-platform
  3. Install a Java 1.7 JDK:
    # apt-get install openjdk-7-jdk
  4. Install Maven:
    apt-get install maven
  5. Get rid of the root access as you won't need it anymore:
    # exit
  6. Download Eclipse. The "for Java EE Developers", "for Java Developers" and "for RCP and RAP Developers" versions all seem to work. Currently the file which was tested to work is (note that it is for 64 bit Ubuntu version) available at this page
  7. Extract the Eclipse installation tarball into your home directory:
    $ cd
    $ tar -xzvf <path/to/your-tar-file>
  8. Increase the memory for the Eclipse installation by modifying the ~/eclipse/eclipse.ini file.
    • Change the -Xmx setting (line 20) to be AT least 1G, recommended 2GB. (i.e. -Xmx2048m).
    • Change the -XX:MaxPermSize (line 18) to at most 512m. If you have the -Xmx setting set to 1G, then I suggest to use a lower value, for example 300m.
  9. Run the Eclipse:
    $ ~/eclipse/eclipse
  10. If everything seems to work, then configure it to have an icon in Desktop
          paste below command into the terminal and hit enter.
     gksudo gedit /usr/share/applications/eclipse.desktop
 
         Above command will create and open the launcher file for eclipse with gedit text editor.
        Paste below content into the opened file and save it.
       [Desktop Entry]
       Name=Eclipse 4
       Type=Application
       Exec=/home/rajeev/eclipse/eclipse
       Terminal=false
       Icon=/home/rajeev/eclipse/icon.xpm
      Comment=Integrated Development Environment
      NoDisplay=false
      Categories=Development;IDE;
      Name[en]=Eclipse

 

Splunk Installation in Ubuntu

sudo dpkg -i Downloads/splunk-6.2.3-264376-linux-2.6-amd64.deb
sudo /opt/splunk/bin/splunk start
http://localhost:8000

Splunk Impala
Splunk Hadoop Connect
[more info]

Installing R in Ubuntu Trusty

Step 1 :- Add the latest trusty link from cran to apt. [click here for reference]
sudo gedit /etc/apt/sources.list

deb http://cran.r-project.org/bin/linux/ubuntu/ trusty/

Step 2 :- Add secure key to check the new added link [click here for more info]

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9





Step 3 :- Check the apt by the following command

sudo apt-get update


Step 4 :- Now run the below command to install R
sudo apt-get install r-base
sudo apt-get install r-base-dev



Step 5 :- Now type R in the shell to get into R command prompt

R


Installing R studio [click here and get started for more info]

apt-get install libjpeg62
$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.1103-amd64.deb
$ sudo gdebi rstudio-server-0.98.1103-amd64.deb



http://localhost:8787

Installing Impala ODBC Driver in Ubuntu 64 bit

As of now you must know cloudera still do not provide any debian package for Impala ODBC driver so I have downloaded the rpm file for SUSE 11 64bit. Then I have converted it to a debian package file using the below command.

sudo apt-get install alien dpkg-dev debhelper build-essential
 
sudo alien ClouderaImpalaODBC-2.5.26.1027-1.x86_64.rpm
  
Now we will install the driver using the command:-

sudo dpkg -i clouderaimpalaodbc_2.5.26.1027-2_amd64.deb

Configuring ODBC Driver:-

Step 1 :- Edit .bashrc file and make the following entry

export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/odbc
export ODBCINI=/etc/odbc.ini
export ODBCSYSINI=/etc
export CLOUDERAIMPALAINI=/opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
[check out the proper location of odbc.ini file using the command odbcinst -j
use three environment variables—ODBCINI, ODBCSYSINI, and
CLOUDERAIMPALAINI—to specify different locations for the odbc.ini, odbcinst.ini, and
cloudera.impalaodbc.ini configuration files by doing the following:
  • Set ODBCINI to point to your odbc.ini file.
  • Set ODBCSYSINI to point to the directory containing the odbcinst.ini file.
  • Set CLOUDERAIMPALAINI to point to your cloudera.impalaodbc.ini file.
For example, if your odbc.ini and odbcinst.ini files are located in /etc and your
cloudera.impalaodbc.ini file is located in /opt/cloudera/impalaodbc/lib/64, then set the environment variables as follows:
]
Step 2 :- ODBC driver managers use configuration files to define and configure ODBC data sources and
drivers. By default, the following configuration files residing in the user’s home directory are used:
 .odbc.ini is used to define ODBC data sources, and it is required.
 .odbcinst.ini is used to define ODBC drivers, and it is optional.

Also, by default the Cloudera ODBC Driver for Impala is configured using the
cloudera.impalaodbc.ini file, which is located in

 /opt/cloudera/impalaodbc/lib/64 for the 64-bit driver on Linux/AIX

Step 3 :- Configuring the odbc.ini File
ODBC Data Source Names (DSNs) are defined in the odbc.ini configuration file. The file is divided
into several sections:
 [ODBC] is optional and used to control global ODBC configuration, such as ODBC tracing.
 [ODBC Data Sources] is required, listing DSNs and associating DSNs with a driver.
 A section having the same name as the data source specified in the [ODBC Data Sources] section
is required to configure the data source.
The following is an example of an odbc.ini configuration file for Linux/AIX:

[ODBC Data Sources]
Sample_Cloudera_Impala_DSN_64=Cloudera Impala ODBC Driver 64-bit
[Sample_Cloudera_Impala_DSN_64]
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
HOST=localhost
PORT=21050


To create a Data Source Name:
1. Open the .odbc.ini configuration file in a text editor.
2. In the [ODBC Data Sources] section, add a new entry by typing the Data Source Name (DSN),
then an equal sign (=), and then the driver name.
3. In the .odbc.ini file, add a new section with a name that matches the DSN you specified in
step 2, and then add configuration options to the section. Specify configuration options as
key-value pairs.
4. Save the .odbc.ini configuration file.

Step 4 :- Configuring the odbcinst.ini File
ODBC drivers are defined in the odbcinst.ini configuration file. The configuration file is optional
because drivers can be specified directly in the odbc.ini configuration file.
The odbcinst.ini file is divided into the following sections:
 [ODBC Drivers] lists the names of all the installed ODBC drivers.
 A section having the same name as the driver name specified in the [ODBC Drivers] section
lists driver attributes and values.
The following is an example of an odbcinst.ini configuration file for Linux/AIX:

[ODBC Drivers]

Cloudera Impala ODBC Driver 64-bit=Installed
[Cloudera Impala ODBC Driver 64-bit]
Description=Cloudera Impala ODBC Driver (64-bit)
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so

To define a driver:
1. Open the .odbcinst.ini configuration file in a text editor.
2. In the [ODBC Drivers] section, add a new entry by typing the driver name and then typing
=Installed
3. In the .odbcinst.ini file, add a new section with a name that matches the driver name you
typed in step 2, and then add configuration options to the section based on the sample
odbcinst.ini file provided in the Setup directory. Specify configuration options as key-value
pairs.
4. Save the .odbcinst.ini configuration file.

Step 5 :- Configuring the cloudera.impalaodbc.ini File
The cloudera.impalaodbc.ini file contains configuration settings for the Cloudera ODBC Driver for
Impala. Settings that you define in the cloudera.impalaodbc.ini file apply to all connections that use the driver.

To configure the Cloudera ODBC Driver for Impala to work with your ODBC driver manager:
1. Open the cloudera.impalaodbc.ini configuration file in a text editor.
2. Edit the DriverManagerEncoding setting. The value is usually UTF-16 or UTF-32 if you are
using Linux/Mac OS X, depending on the ODBC driver manager you use. iODBC uses UTF-
32, and unixODBC uses UTF-16.
OR
If you are using AIX and the unixODBC driver manager, then set the value to UTF-16. If you
are using AIX and the iODBC driver manager, then set the value to UTF-16 for the 32-bit
driver or UTF-32 for the 64-bit driver.
3. Edit the ODBCInstLib setting. The value is the name of the ODBCInst shared library for the
ODBC driver manager you use. To determine the correct library to specify, refer to your
ODBC driver manager documentation.
The configuration file defaults to the shared library for iODBC. In Linux/AIX, the shared
library name for iODBC is libiodbcinst.so.
4. Optionally, configure logging by editing the LogLevel and LogPath settings. For more
information, see "Configuring Logging Options" on page 28.
5. Save the cloudera.impalaodbc.ini configuration file.

Step 6 :- Check the entry and configuration of ODBC by typing

odbcinst  -q -s
   
 isql -v Sample_Cloudera_Impala_DSN_64
 
Trouble Shooting :-
 
Well I have got one error like 
[S1000][unixODBC][Cloudera][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function.
 
which means that the driver is not linked to libodbcinst.so
Please check it first with the command
ldd /opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
  
then search for libodbcinst.so
 
find / -name "libodbcinst.so*" 
If not found then install it 

sudo apt-get update && sudo apt-get install unixodbc-dev libmyodbc
or
 
sudo apt-get install unixODBC unixODBC-dev
 
Then again try to search for libodbcinst.so
 
and make entry in .bashrc as
 
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
 
For more info click here and here and here and here.