UK Transport Data :- https://tfl.gov.uk/info-for/open-data-users/
HTML/JavaScript
Flume Installation and Streaming Twitter Data Using Flume
Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large
amounts of log data. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. It
uses a simple extensible data model that allows for online analytic
application.
Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to:- Stream data from multiple sources into Hadoop for analysis
- Collect high-volume Web logs in real time
- Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
- Guarantee data delivery
- Scale horizontally to handle additional data volume
- Event – a singular unit of data that is transported by Flume (typically a single log entry
- Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
- Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
- Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
- Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
- Client – produces and transmits the Event to the Source operating within the Agent
A flow in Flume starts from the Client (Web Server).
The Client transmits the event to a Source operating within the Agent.
The Source receiving this event then delivers it to one or more
Channels. These Channels are drained by one or more Sinks operating
within the same Agent. Channels allow decoupling of ingestion rate from
drain rate using the familiar producer-consumer model of data exchange.
When spikes in client side activity cause data to be generated faster
than what the provisioned capacity on the destination can handle, the
channel size increases. This allows sources to continue normal operation
for the duration of the spike. Flume agents can be chained together by
connecting the sink of one agent to the source of another agent. This
enables the creation of complex dataflow topologies.
Now we will install apache flume on our virtual machine.STEP 1:
Download flume:
Command: wget http://archive.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz
Command: ls
STEP 2:
Extract file from flume tar file.
Command: tar -xvf apache-flume-1.4.0-bin.tar.gz
Command: ls
STEP 3:
Put apache-flume-1.4.0-bin directory inside /usr/lib/ directory.
Command: sudo mv apache-flume-1.4.0-bin /usr/lib/
STEP 4:
We need to remove protobuf-java-2.4.1.jar and guava-10.1.1.jar from lib directory of apache-flume-1.4.0-bin ( when using hadoop-2.x )
Command: sudo rm /usr/lib/apache-flume-1.4.0-bin/lib/protobuf-java-2.4.1.jar /usr/lib/apache-flume-1.4.0-bin/lib/guava-10.0.1.jar
STEP 5:
Use below link and download flume-sources-1.0-SNAPSHOTS.jar
https://drive.google.com/file/d/0B-Cl0IfLnRozUHcyNDBJWnNxdHc/view?usp=sharing
Save the file.
STEP 6:
Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:
Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/
STEP 7:
Check whether flume SNAPSHOT has moved to the lib folder of apache flume:
Command: ls /usr/lib/apache-flume-1.4.0-bin/lib/flume*
STEP 8:
Copy flume-env.sh.template content to flume-env.sh
Command: cd /usr/lib/apache-flume-1.4.0-bin/
Command: sudo cp conf/flume-env.sh.template conf/flume-env.sh
STEP 9:
Edit flume-env.sh as mentioned in below snapshot.
command: sudo gedit conf/flume-env.sh
Set JAVA_HOME and FLUME_CLASSPATH as shown in below snapshot.
Now we have installed flume on our machine. Lets run flume to stream twitter data on to HDFS.
We need to create an application in twitter and use its credentials to fetch data.
STEP 10:
Open a Browser and go to the below URL:
URL:https://twitter.com/
STEP 11:
Enter your Twitter account credentials and sign in:
STEP 12:
Your twitter home page will open:
STEP 13:
Change the URL to https://apps.twitter.com
STEP 14:
Click on Create New App to create a new application and enter all the details in the application:
STEP 15:
Check Yes, I agree and click on Create your Twitter application:
STEP 16:
Your Application will be created:
STEP 17:
Click on Keys and Access Tokens, you will get Consumer Key and Consumer Secret.
STEP 18:
Scroll down and Click on Create my access token:
Your Access token got created:
Consumer Key (API Key) 4AtbrP50QnfyXE2NlYwROBpTm
Consumer Secret (API Secret) jUpeHEZr5Df4q3dzhT2C0aR2N2vBidmV6SNlEELTBnWBMGAwp3
Access Token 1434925639-p3Q2i3l2WLx5DvmdnFZWlYNvGdAOdf5BrErpGKk
Access Token Secret AghOILIp9JJEDVFiRehJ2N7dZedB1y4cHh0MvMJN5DQu7
STEP 19:
Use below link to download flume.conf file
https://drive.google.com/file/d/0B-Cl0IfLnRozdlRuN3pPWEJ1RHc/view?usp=sharing
Save the file.
STEP 20:
Put the flume.conf in the conf directory of apache-flume-1.4.0-bin
Command: sudo cp /home/centos/Downloads/flume.conf /usr/lib/apache-flume-1.4.0-bin/conf/
STEP 21:
Edit flume.conf
Command: sudo gedit conf/flume.conf
Replace all the below highlighted credentials in flume.conf with the credentials (Consumer Key, Consumer Secret, Access Token, Access Token Secret) you received after creating the application very carefully, rest all will remain same, save the file and close it.
STEP 22:
Change permissions for flume directory.
Command: sudo chmod -R 755 /usr/lib/apache-flume-1.4.0-bin/
STEP 23:
Start fetching the data from twitter:
Command: ./bin/flume-ng agent -n TwitterAgent -c conf -f /usr/lib/apache-flume-1.4.0-bin/conf/flume.conf
Now wait for 20-30 seconds and let flume stream the data on HDFS, after that press ctrl + c to break the command and stop the streaming. (Since you are stopping the process, you may get few exceptions, ignore it)
STEP 24:
Open the Mozilla browser in your VM, and go to /user/flume/tweets in HDFS
Click on FlumeData file which got created:
If you can see data similar as shown in below snapshot, then the unstructured data has been streamed from twitter on to HDFS successfully. Now you can do analytics on this twitter data using Hive.
Analytics Tutorial: Learn Linear Regression in R
The R-Factor
There is often a gap in what we are taught in college and the knowledge that we need to possess to be successful in our professional lives. This is exactly what happened to me when I joined a consultancy firm as a business analyst. At that time I was a fresher coming straight from the cool college atmosphere, newly exposed to the Corporate Heat.One day my boss called me to his office and told me that one of their clients, a big insurance company, was facing significant losses on auto insurance. They had hired us to identify and quantify the factors responsible for it. My boss emailed me the data that the company had provided and asked me to do a multivariate linear regression analysis on it. My boss told me to use R and make a presentation of the summary.
Now as a statistics student I was quite aware of the principles of a multivariate linear regression, but I had never used R. For those of you who are not aware, R is a statistical programming language. It is a very powerful tool and widely used across the world in analyzing data. Of course, I did not know this at that time.
Anyways, it took me a lot of surfing on the internet and reading books to learn how to fit my model in R. and now I want to help you guys save that time!
R is an open source tool easily available on the internet. I'll assume you have it installed on your computer. Else you can easily download and install it from www.r-project.org/
I have already converted the raw data file from the client into a clean .csv (comma separated) file. click here to download the file.
I've saved this on the D drive of computer in a folder called Linear_Reg_Sample. You can save it anywhere, but remember to change the path wherever a file path is mentioned.
Open the R software that you've installed. It's time to get started!
Let's Start Regression in R
The first thing to do is obviously read all our data in R. This can be easily done using the command: >LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv")Here we read all the data into an object LinRegData, using a function read.csv().
NOTE: If you observe closely, you'll see that we have used \\ instead of a \. This is because of the construct of the language. Whenever you enter a path, make sure to use \\
Let's see if our data has been read by R. Use the following command to get a summary of the data: >summary(LinRegData)
This will give output
Image 1: Summary of input data
In the output you can see the distribution of data. The min, max, median, mean are shown for all the variables.
Performing the Regression Analysis
Now that the data has been loaded, we need to fit a regression model over it.We will use the following command in R to fit the model: >FitLinReg <- lm(Capped_Losses ~ Number_Vehicles + Average_Age + Gender_Dummy + Married_Dummy + Avg_Veh_Age + Fuel_Type_Dummy, LinRegData)
In this command, we create an object FitLinReg and store the results of our regression model in it. The lm() function is used to fit the model. Inside the model, Capped_Losses is our dependent variable which we are trying to explain using the other variables that are separated by a + sign. The last parameter of the formula is the source of the data.
If no error is displayed, it means our regression is done and the results are stored in FitLinReg. We can see the results using two commands:
1. >FitLinReg
This gives the output:
2. >summary(FitLinReg)
This gives the output:
The summary command gives us the intercepts of each variable, its standard error, t value and significance.
The output also tells us what the significance level of each variable is. For e.g., a *** variable highly is significant, a ** variable is significant at the 99.9% level and a space next to the variable indicates that it is not significant.
We
can easily see that the Number_Vehicles variable is not significant and
does not affect the model. We can remove this variable from the model.
If
you go through what we've done till now, you will realize that it took
us just two commands to fit a multivariate model in R. See how simple
life has become!!!
Happy Ending!
In this way I learnt how to fit a regression model using R. I made a summary of my findings and made a presentation to the clients.
My boss was rather happy with me and I received a hefty bonus that year.
Hadoop Administration Interview Questions and Answers
https://www.dezyre.com/hadoop-tutorial/hadoop-multinode-cluster-setup
It is essential to prepare yourself in order to pass an interview and land your dream job. Here’s the first step to achieving this. The following are some frequently asked Hadoop Administration interview questions and answers that might be useful.
Explain check pointing in Hadoop and why is it important?
Check pointing is an essential part of
maintaining and persisting filesystem metadata in HDFS. It’s crucial for
efficient Namenode recovery and restart and is an important indicator
of overall cluster health.
Namenode persists filesystem metadata.
At a high level, namenode’s primary responsibility is to store the HDFS
namespace. Meaning, things like the directory tree, file permissions
and the mapping of files to block IDs. It is essential that this
metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in
two different parts: the fsimage and the edit log. The fsimage is a file
that represents a point-in-time snapshot of the filesystem’s metadata.
However, while the fsimage file format is very efficient to read, it’s
unsuitable for making small incremental updates like renaming a single
file. Thus, rather than writing a new fsimage every time the namespace
is modified, the NameNode instead records the modifying operation in the
edit log for durability. This way, if the NameNode crashes, it can
restore its state by first loading the fsimage then replaying all the
operations (also called edits or transactions) in the edit log to catch
up to the most recent state of the namesystem. The edit log comprises a
series of files, called edit log segments, that together represent all
the namesystem modifications made since the creation of the fsimage.
What is default block size in HDFS and what are the benefits of having smaller block sizes?
Most block-structured file systems use a
block size on the order of 4 or 8 KB. By contrast, the default block
size in HDFS is 64MB – and larger. This allows HDFS to decrease the
amount of metadata storage required per file. Furthermore, it allows
fast streaming reads of data, by keeping large amounts of data
sequentially organized on the disk. As a result, HDFS is expected to
have very large files that are read sequentially. Unlike a file system
such as NTFS or EXT which has numerous small files, HDFS stores a modest
number of very large files: hundreds of megabytes, or gigabytes each.
What are two main modules which help you interact with HDFS and what are they used for?
user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The moduleName tells the program which
subset of Hadoop functionality to use. -cmd is the name of a specific
command within this module to execute. Its arguments follow the command
name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’,
provides basic file manipulation operations and works with objects
within the file system. The dfsadmin module manipulates or queries the
file system as a whole.
How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?
Datanodes can store blocks in multiple
directories typically located on different local disk drives. In order
to setup multiple directories one needs to specify a comma separated
list of pathnames as values under config paramters
dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place
equal amount of data in each of the directories.
Namenode also supports multiple
directories, which stores the name space image and edit logs. In order
to setup multiple directories one needs to specify a comma separated
list of pathnames as values under config paramters
dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used
for the namespace data replication so that image and log could be
restored from the remaining disks/volumes if one of the disks fails.
How do you read a file from HDFS?
The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program
reads the cluster config file on the local machine which tells it where
the namemode is located. This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The client’s validated request is checked against the owner and permissions of the file.
Step 6: If the file
exists and the user has access to it then the NameNode responds with the
first block id and provides a list of datanodes a copy of the block can
be found, sorted by their distance to the client (reader).
Step 7: The client now
contacts the most appropriate datanode directly and reads the block
data. This process repeats until all blocks in the file have been read
or the client closes the file stream.
If while reading the file the datanode
dies, library will automatically attempt to read another replica of the
data from another datanode. If all replicas are unavailable, the read
operation fails and the client receives an exception. In case
the information returned by the NameNode about block locations are
outdated by the time the client attempts to contact a datanode, a retry
will occur if there are other replicas or the read will fail.
What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?
Schedulers are responsible for assigning
tasks to open slots on tasktrackers. The scheduler is a plug-in within
the jobtracker. The three types of schedulers are:
- FIFO (First in First Out) Scheduler
- Fair Scheduler
- Capacity Scheduler
How do you decide which scheduler to use?
The CS scheduler can be used under the following situations:
- When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
- When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
- When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
- When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
- When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
- When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
- When you require jobs within a pool to make equal progress rather than running in FIFO order.
Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?
DFS.NAME.DIR specifies the path of
directory in Namenode’s local file system to store HDFS’s metadata
and DFS.DATA.DIR specifies the path of directory in Datanode’s local
file system to store HDFS’s file blocks. These paramters are specified
in HDFS-SITE.XML config file of all nodes in the cluster, including
master and slave nodes.
If these paramters are not specified,
namenode’s metadata and Datanode’s file blocks related information gets
stored in /tmp under HADOOP-USERNAME directory. This is not a safe
place, as when nodes are restarted, data will be lost and is critical if
Namenode is restarted, as formatting information will be lost.
What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?
FileSystem checking utility FSCK is used
to check and display the health of file system, files and blocks in it.
When used with a path ( bin/Hadoop fsck / -files –blocks –locations
-racks) it recursively shows the health of all files under the path. And
when used with ‘/’ , it checks the entire file system. By Default FSCK
ignores files still open for writing by a client. To list such files,
run FSCK with -openforwrite option.
FSCK checks the file system, prints out a
dot for each file found healthy, prints a message of the ones that are
less than healthy, including the ones which have over replicated blocks,
under-replicated blocks, mis-replicated blocks, corrupt blocks and
missing replicas.
What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?
The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
- Hadoop-env.sh
- Core-site.xml
- Hdfs-site.xml
- Mapred-site.xml
- Masters
- Slaves
These files can be found in your
Hadoop>conf directory. If Hadoop daemons are started individually
using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of
daemon, then masters and slaves file need not be updated and can be
empty. This way of starting daemons requires command to be issued on
appropriate nodes to start appropriate daemons. If Hadoop daemons are
started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters
and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.
Installing Cassandra in Ubuntu
cd /tmp
wget http://www.us.apache.org/dist/cassandra/2.1.6/apache-cassandra-2.1.6-bin.tar.gz
tar -xvzf apache-cassandra-2.1.6-bin.tar.gz
mv apache-cassandra-2.1.6 ~/cassandra
sudo mkdir /var/lib/cassandra
sudo mkdir /var/log/cassandra
sudo chown -R $USER:$GROUP /var/lib/cassandra
sudo chown -R $USER:$GROUP /var/log/cassandra
sudo gedit .bashrc
export CASSANDRA_HOME=~/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin
sudo sh ~/cassandra/bin/cassandra
sudo sh ~/cassandra/bin/cassandra-cli
wget http://www.us.apache.org/dist/cassandra/2.1.6/apache-cassandra-2.1.6-bin.tar.gz
tar -xvzf apache-cassandra-2.1.6-bin.tar.gz
mv apache-cassandra-2.1.6 ~/cassandra
sudo mkdir /var/lib/cassandra
sudo mkdir /var/log/cassandra
sudo chown -R $USER:$GROUP /var/lib/cassandra
sudo chown -R $USER:$GROUP /var/log/cassandra
sudo gedit .bashrc
export CASSANDRA_HOME=~/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin
sudo sh ~/cassandra/bin/cassandra
sudo sh ~/cassandra/bin/cassandra-cli
Installing Python in Ubuntu
$ sudo apt-get install python2.7
$ sudo apt-get install python2.7-dev
$ sudo apt-get install python-pip
$ pip
$ sudo pip install numpy
$ sudo pip install ipython
$ sudo pip install pandas
type ipython or
Hive Interview Question
Wipro :-
1. Write syntax to hive creating a table and explain each part.
2. What is location stands for in that syntax?
3. What is stored as command do? how many type of files are there? what are their difference?
4. What is serde ? Why you use it? What are different format of Serde ?
5. How to process an unbounded XML file with schema defined in hive ?
6. What is UDF and UTDF? what are the difference between them ?
7. What is RC and ORC file ? and why they have been used for?
Common :-
1. How to load bulk data in hive partition?
2. What are the drawbacks of Hive?
3. What hive and hadoop version you have worked on ?
4. How to do update and delete in in Hive?
5. Incremental update in Hive ?
1. Write syntax to hive creating a table and explain each part.
2. What is location stands for in that syntax?
3. What is stored as command do? how many type of files are there? what are their difference?
4. What is serde ? Why you use it? What are different format of Serde ?
5. How to process an unbounded XML file with schema defined in hive ?
6. What is UDF and UTDF? what are the difference between them ?
7. What is RC and ORC file ? and why they have been used for?
Common :-
1. How to load bulk data in hive partition?
2. What are the drawbacks of Hive?
3. What hive and hadoop version you have worked on ?
4. How to do update and delete in in Hive?
5. Incremental update in Hive ?
Hadoop Cluster Configuration Files
In edureka blog they have discussed, how to create a Hadoop Cluster on AWS in 30 minutes.
In continuation to that, this blog talks about important Hadoop Cluster Configuration Files.
The following table lists the same.
All these files are available under ‘conf’ directory of Hadoop installation directory.
Here is a listing of these files in the File System:
Let’s look at the files and their usage one by one!
hadoop-env.sh
This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.
This file is also used for setting another Hadoop daemon execution environment such as heap size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location (HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary parameters to start a cluster.
The following three files are the important configuration files for the runtime environment settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.Where hostname and port are the machine and port on which NameNode daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 8020 and you can also specify IP address rather than hostname.
hdfs-site.sh
This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.You can also configure hdfs-site.xml to specify default block replication and permission checking on HDFS. The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.
The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.
mapred-site.sh
This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker to Task Trackers and MapReduce clients.You can replicate all of the four files explained above to all the Data Nodes and Secondary Namenode. These files can then be configured for any node specific configuration e.g. in case of a different JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop cluster.
Masters
This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.The ‘masters’ file on Slave Nodes is blank.
Slaves
The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node and Task Tracker servers.The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in the cluster.
Banking Case Study
Workflow :-
1. Group the data of loan by loan id
2. Group the client data by client id
3. Generate Age from client data and store
4. Get the transaction data for last year
5. Sum up the data based on transaction type and amount
6. Group the card data by disposition id
7. Group the district data by district id
8. Filter out the un employment data for the year 95 & 96 from district
9. Generate the difference between the unemployment data for every district for those two years
10. Group the disposition data
11. Joining :-
join loan,transaction,Account,Disposition,on ac_id as ac_id_join
join ac_id_join,district_info,client on district_id as include_district
join include_district,card on disposition_id as join_done
select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday
12. Algorithm used to predict excellent, good and risky customers:
12.1.within 1 year {
if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)
write in a file called good more loan can be granted card can be upgrade
12.2. if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80
write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan
12.3. if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100
write in a file called risky no more loans card must be downgraded
}
1. Group the data of loan by loan id
2. Group the client data by client id
3. Generate Age from client data and store
4. Get the transaction data for last year
5. Sum up the data based on transaction type and amount
6. Group the card data by disposition id
7. Group the district data by district id
8. Filter out the un employment data for the year 95 & 96 from district
9. Generate the difference between the unemployment data for every district for those two years
10. Group the disposition data
11. Joining :-
join loan,transaction,Account,Disposition,on ac_id as ac_id_join
join ac_id_join,district_info,client on district_id as include_district
join include_district,card on disposition_id as join_done
select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday
12. Algorithm used to predict excellent, good and risky customers:
12.1.within 1 year {
if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)
write in a file called good more loan can be granted card can be upgrade
12.2. if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80
write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan
12.3. if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100
write in a file called risky no more loans card must be downgraded
}
Banking Domain Case Study in Hadoop and R
In this blog and the next
few ones that will follow, we will analyze a banking domain dataset,
which contains several files with details of its customers. This
database was prepared by Petr Berka and Marta Sochorova.
The Berka dataset is a collection of
financial information from a Czech bank. The dataset deals with over
5,300 bank clients with approximately 1,000,000 transactions.
Additionally, the bank represented in the dataset has extended close to
700 loans and issued nearly 900 credit cards, all of which are
represented in the data.
By the time you finish reading this blog, you would have learned :- How to analyze a bank’s data to predict a customer’s quality
- Using this analysis we can categorize a customer into three categories:
- Excellent: Customers whose record is good with the bank
- Good: Customers who have average earning with a good record till now
- Risky: Customers who are under debt of bank or who has not paid the loan on time
- How to write PIG UDF
- How to connect Hadoop with R
- How to load data from Hadoop to R
How to analyze a bank’s data to predict the customer’s quality
Prerequisite
Software Technology
- Java installed Hadoop concepts
- Hadoop installed Java concepts
- Pig installed Pig concepts
- R-base
- Rstudio
- Ubuntu OS
View the detail case study here.
Creating Your First Map Reduce Programme
Opening the New Java Project wizard
The New Java Project wizard can be used to create a new java project. There are many ways to open this wizard:- By clicking on the File menu and choosing New > Java Project
- By right clicking anywhere in the Project Explorer and selecting New > Java Project
- By clicking on the New button ( ) in the Tool bar and selecting Java Project
Using the New Java Project wizard
The New Java Project Wizard has two pages.On the first page:
- Enter the Project Name
- Select the Java Runtime Environment (JRE) or leave it at the default
- Select the Project Layout which determines whether there would be
a separate folder for the sources code and class files. The recommended
option is to create separate folders for sources and class files.
You can click on the Finish button to create the project or click on the Next button to change the java build settings.
On the second page you can change the Java Build Settings like setting the Project dependency (if there are multiple projects) and adding additional jar files to the build path.
Writing the Mapper Class
As we all start up with writing some basic code for map reduce hence we will write a Word Count program which will simply count the number of words in a file and give a out put.
Now here in the mapper class we write WordCountMapper
package com.hadoop.training;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.util.StringTokenizer;
public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key,Text value, Context context) throws IOException,InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()){
word.set(itr.nextToken());
context.write(word,one);
}
}
}
Writing the Reducer Class
Now here in the reducer class we write WordCountReducer
package com.hadoop.training;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
private IntWritable result = new IntWritable();
public void reduce(Text key,Iterable<IntWritable> value, Context context) throws IOException,InterruptedException {
int sum = 0;
for (IntWritable val : value) {
sum +=val.get();
}
result.set(sum);
context.write(key,result);
}
}
Writing the MapReduce driver class
Writing the MapReduce driver class as WordCount
package com.hadoop.training;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main (String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
@SuppressWarnings("deprecation")
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Running The Map Reduce programme
$ hadoop jar WC.jar com.hadoop.training.WordCount hdfs://localhost:8020/user/rajeev/input hdfs://localhost:8020/user/rajeev/output
Eclipse Installation in Ubuntu
- Open a terminal (Ctrl-Alt-T) and switch it to root permissions by entering:
$ sudo su
- Make sure Eclipse Indigo is NOT installed in your Ubuntu. You may
need to remove both "eclipse" and "eclipse-platform" packages to get rid
of it. If it still gets into way when trying to install Luna using this
easy way, you may need to look at the "hard way" below.
# apt-get remove eclipse eclipse-platform
- Install a Java 1.7 JDK:
# apt-get install openjdk-7-jdk
- Install Maven:
apt-get install maven
- Get rid of the root access as you won't need it anymore:
# exit
- Download Eclipse. The "for Java EE Developers", "for Java Developers" and "for RCP and RAP Developers" versions all seem to work. Currently the file which was tested to work is (note that it is for 64 bit Ubuntu version) available at this page
- Extract the Eclipse installation tarball into your home directory:
$ cd
$ tar -xzvf <path/to/your-tar-file>
- Increase the memory for the Eclipse installation by modifying the ~/eclipse/eclipse.ini file.
- Change the -Xmx setting (line 20) to be AT least 1G, recommended 2GB. (i.e. -Xmx2048m).
- Change the -XX:MaxPermSize (line 18) to at most 512m. If you have the -Xmx setting set to 1G, then I suggest to use a lower value, for example 300m.
- Run the Eclipse:
$ ~/eclipse/eclipse
- If everything seems to work, then configure it to have an icon in Desktop
gksudo gedit /usr/share/applications/eclipse.desktop
Above command will create and open the launcher file for eclipse with gedit text editor.
Paste below content into the opened file and save it.
[Desktop Entry]
Name=Eclipse 4
Type=Application
Exec=/home/rajeev/eclipse/eclipse
Terminal=false
Icon=/home/rajeev/eclipse/icon.xpm
Comment=Integrated Development Environment
NoDisplay=false
Categories=Development;IDE;
Name[en]=Eclipse
Splunk Installation in Ubuntu
sudo dpkg -i Downloads/splunk-6.2.3-264376-linux-2.6-amd64.deb
sudo /opt/splunk/bin/splunk start
http://localhost:8000
Splunk Impala
Splunk Hadoop Connect
[more info]
sudo /opt/splunk/bin/splunk start
http://localhost:8000
Splunk Impala
Splunk Hadoop Connect
[more info]
Installing R in Ubuntu Trusty
Step 1 :- Add the latest trusty link from cran to apt. [click here for reference]
sudo gedit /etc/apt/sources.list
deb http://cran.r-project.org/bin/linux/ubuntu/ trusty/
Step 2 :- Add secure key to check the new added link [click here for more info]
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
Step 3 :- Check the apt by the following command
sudo apt-get update
Step 4 :- Now run the below command to install R
sudo apt-get install r-base
sudo apt-get install r-base-dev
Step 5 :- Now type R in the shell to get into R command prompt
R
Installing R studio [click here and get started for more info]
apt-get install libjpeg62
sudo gedit /etc/apt/sources.list
deb http://cran.r-project.org/bin/linux/ubuntu/ trusty/
Step 2 :- Add secure key to check the new added link [click here for more info]
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
Step 3 :- Check the apt by the following command
sudo apt-get update
Step 4 :- Now run the below command to install R
sudo apt-get install r-base
sudo apt-get install r-base-dev
Step 5 :- Now type R in the shell to get into R command prompt
R
Installing R studio [click here and get started for more info]
apt-get install libjpeg62
$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.1103-amd64.deb
$ sudo gdebi rstudio-server-0.98.1103-amd64.deb
http://localhost:8787
Installing Impala ODBC Driver in Ubuntu 64 bit
As of now you must know cloudera still do not provide any debian package for Impala ODBC driver so I have downloaded the rpm file for SUSE 11 64bit. Then I have converted it to a debian package file using the below command.
sudo dpkg -i clouderaimpalaodbc_2.5.26.1027-2_amd64.deb
Configuring ODBC Driver:-
Step 1 :- Edit .bashrc file and make the following entry
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/odbc
export ODBCINI=/etc/odbc.ini
export ODBCSYSINI=/etc
export CLOUDERAIMPALAINI=/opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
[check out the proper location of odbc.ini file using the command odbcinst -j
use three environment variables—ODBCINI, ODBCSYSINI, and
CLOUDERAIMPALAINI—to specify different locations for the odbc.ini, odbcinst.ini, and
cloudera.impalaodbc.ini configuration files by doing the following:
Step 2 :- ODBC driver managers use configuration files to define and configure ODBC data sources and
drivers. By default, the following configuration files residing in the user’s home directory are used:
.odbc.ini is used to define ODBC data sources, and it is required.
.odbcinst.ini is used to define ODBC drivers, and it is optional.
Also, by default the Cloudera ODBC Driver for Impala is configured using the
cloudera.impalaodbc.ini file, which is located in
/opt/cloudera/impalaodbc/lib/64 for the 64-bit driver on Linux/AIX
Step 3 :- Configuring the odbc.ini File
ODBC Data Source Names (DSNs) are defined in the odbc.ini configuration file. The file is divided
into several sections:
[ODBC] is optional and used to control global ODBC configuration, such as ODBC tracing.
[ODBC Data Sources] is required, listing DSNs and associating DSNs with a driver.
A section having the same name as the data source specified in the [ODBC Data Sources] section
is required to configure the data source.
The following is an example of an odbc.ini configuration file for Linux/AIX:
[ODBC Data Sources]
Sample_Cloudera_Impala_DSN_64=Cloudera Impala ODBC Driver 64-bit
[Sample_Cloudera_Impala_DSN_64]
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
HOST=localhost
PORT=21050
To create a Data Source Name:
1. Open the .odbc.ini configuration file in a text editor.
2. In the [ODBC Data Sources] section, add a new entry by typing the Data Source Name (DSN),
then an equal sign (=), and then the driver name.
3. In the .odbc.ini file, add a new section with a name that matches the DSN you specified in
step 2, and then add configuration options to the section. Specify configuration options as
key-value pairs.
4. Save the .odbc.ini configuration file.
Step 4 :- Configuring the odbcinst.ini File
ODBC drivers are defined in the odbcinst.ini configuration file. The configuration file is optional
because drivers can be specified directly in the odbc.ini configuration file.
The odbcinst.ini file is divided into the following sections:
[ODBC Drivers] lists the names of all the installed ODBC drivers.
A section having the same name as the driver name specified in the [ODBC Drivers] section
lists driver attributes and values.
The following is an example of an odbcinst.ini configuration file for Linux/AIX:
[ODBC Drivers]
Cloudera Impala ODBC Driver 64-bit=Installed
[Cloudera Impala ODBC Driver 64-bit]
Description=Cloudera Impala ODBC Driver (64-bit)
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
To define a driver:
1. Open the .odbcinst.ini configuration file in a text editor.
2. In the [ODBC Drivers] section, add a new entry by typing the driver name and then typing
=Installed
3. In the .odbcinst.ini file, add a new section with a name that matches the driver name you
typed in step 2, and then add configuration options to the section based on the sample
odbcinst.ini file provided in the Setup directory. Specify configuration options as key-value
pairs.
4. Save the .odbcinst.ini configuration file.
Step 5 :- Configuring the cloudera.impalaodbc.ini File
The cloudera.impalaodbc.ini file contains configuration settings for the Cloudera ODBC Driver for
Impala. Settings that you define in the cloudera.impalaodbc.ini file apply to all connections that use the driver.
To configure the Cloudera ODBC Driver for Impala to work with your ODBC driver manager:
1. Open the cloudera.impalaodbc.ini configuration file in a text editor.
2. Edit the DriverManagerEncoding setting. The value is usually UTF-16 or UTF-32 if you are
using Linux/Mac OS X, depending on the ODBC driver manager you use. iODBC uses UTF-
32, and unixODBC uses UTF-16.
OR
If you are using AIX and the unixODBC driver manager, then set the value to UTF-16. If you
are using AIX and the iODBC driver manager, then set the value to UTF-16 for the 32-bit
driver or UTF-32 for the 64-bit driver.
3. Edit the ODBCInstLib setting. The value is the name of the ODBCInst shared library for the
ODBC driver manager you use. To determine the correct library to specify, refer to your
ODBC driver manager documentation.
The configuration file defaults to the shared library for iODBC. In Linux/AIX, the shared
library name for iODBC is libiodbcinst.so.
4. Optionally, configure logging by editing the LogLevel and LogPath settings. For more
information, see "Configuring Logging Options" on page 28.
5. Save the cloudera.impalaodbc.ini configuration file.
Step 6 :- Check the entry and configuration of ODBC by typing
odbcinst -q -s
sudo apt-get install alien dpkg-dev debhelper build-essential
sudo alien ClouderaImpalaODBC-2.5.26.1027-1.x86_64.rpm
Now we will install the driver using the command:-
sudo dpkg -i clouderaimpalaodbc_2.5.26.1027-2_amd64.deb
Configuring ODBC Driver:-
Step 1 :- Edit .bashrc file and make the following entry
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/odbc
export ODBCINI=/etc/odbc.ini
export ODBCSYSINI=/etc
export CLOUDERAIMPALAINI=/opt/cloudera/impalaodbc/lib/64/cloudera.impalaodbc.ini
[check out the proper location of odbc.ini file using the command odbcinst -j
use three environment variables—ODBCINI, ODBCSYSINI, and
CLOUDERAIMPALAINI—to specify different locations for the odbc.ini, odbcinst.ini, and
cloudera.impalaodbc.ini configuration files by doing the following:
- Set ODBCINI to point to your odbc.ini file.
- Set ODBCSYSINI to point to the directory containing the odbcinst.ini file.
- Set CLOUDERAIMPALAINI to point to your cloudera.impalaodbc.ini file.
For example, if your odbc.ini and odbcinst.ini files are located in /etc and your
cloudera.impalaodbc.ini file is located in /opt/cloudera/impalaodbc/lib/64, then set the environment variables as follows:
]cloudera.impalaodbc.ini file is located in /opt/cloudera/impalaodbc/lib/64, then set the environment variables as follows:
Step 2 :- ODBC driver managers use configuration files to define and configure ODBC data sources and
drivers. By default, the following configuration files residing in the user’s home directory are used:
.odbc.ini is used to define ODBC data sources, and it is required.
.odbcinst.ini is used to define ODBC drivers, and it is optional.
Also, by default the Cloudera ODBC Driver for Impala is configured using the
cloudera.impalaodbc.ini file, which is located in
/opt/cloudera/impalaodbc/lib/64 for the 64-bit driver on Linux/AIX
Step 3 :- Configuring the odbc.ini File
ODBC Data Source Names (DSNs) are defined in the odbc.ini configuration file. The file is divided
into several sections:
[ODBC] is optional and used to control global ODBC configuration, such as ODBC tracing.
[ODBC Data Sources] is required, listing DSNs and associating DSNs with a driver.
A section having the same name as the data source specified in the [ODBC Data Sources] section
is required to configure the data source.
The following is an example of an odbc.ini configuration file for Linux/AIX:
[ODBC Data Sources]
Sample_Cloudera_Impala_DSN_64=Cloudera Impala ODBC Driver 64-bit
[Sample_Cloudera_Impala_DSN_64]
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
HOST=localhost
PORT=21050
To create a Data Source Name:
1. Open the .odbc.ini configuration file in a text editor.
2. In the [ODBC Data Sources] section, add a new entry by typing the Data Source Name (DSN),
then an equal sign (=), and then the driver name.
3. In the .odbc.ini file, add a new section with a name that matches the DSN you specified in
step 2, and then add configuration options to the section. Specify configuration options as
key-value pairs.
4. Save the .odbc.ini configuration file.
Step 4 :- Configuring the odbcinst.ini File
ODBC drivers are defined in the odbcinst.ini configuration file. The configuration file is optional
because drivers can be specified directly in the odbc.ini configuration file.
The odbcinst.ini file is divided into the following sections:
[ODBC Drivers] lists the names of all the installed ODBC drivers.
A section having the same name as the driver name specified in the [ODBC Drivers] section
lists driver attributes and values.
The following is an example of an odbcinst.ini configuration file for Linux/AIX:
[ODBC Drivers]
Cloudera Impala ODBC Driver 64-bit=Installed
[Cloudera Impala ODBC Driver 64-bit]
Description=Cloudera Impala ODBC Driver (64-bit)
Driver=/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
To define a driver:
1. Open the .odbcinst.ini configuration file in a text editor.
2. In the [ODBC Drivers] section, add a new entry by typing the driver name and then typing
=Installed
3. In the .odbcinst.ini file, add a new section with a name that matches the driver name you
typed in step 2, and then add configuration options to the section based on the sample
odbcinst.ini file provided in the Setup directory. Specify configuration options as key-value
pairs.
4. Save the .odbcinst.ini configuration file.
Step 5 :- Configuring the cloudera.impalaodbc.ini File
The cloudera.impalaodbc.ini file contains configuration settings for the Cloudera ODBC Driver for
Impala. Settings that you define in the cloudera.impalaodbc.ini file apply to all connections that use the driver.
To configure the Cloudera ODBC Driver for Impala to work with your ODBC driver manager:
1. Open the cloudera.impalaodbc.ini configuration file in a text editor.
2. Edit the DriverManagerEncoding setting. The value is usually UTF-16 or UTF-32 if you are
using Linux/Mac OS X, depending on the ODBC driver manager you use. iODBC uses UTF-
32, and unixODBC uses UTF-16.
OR
If you are using AIX and the unixODBC driver manager, then set the value to UTF-16. If you
are using AIX and the iODBC driver manager, then set the value to UTF-16 for the 32-bit
driver or UTF-32 for the 64-bit driver.
3. Edit the ODBCInstLib setting. The value is the name of the ODBCInst shared library for the
ODBC driver manager you use. To determine the correct library to specify, refer to your
ODBC driver manager documentation.
The configuration file defaults to the shared library for iODBC. In Linux/AIX, the shared
library name for iODBC is libiodbcinst.so.
4. Optionally, configure logging by editing the LogLevel and LogPath settings. For more
information, see "Configuring Logging Options" on page 28.
5. Save the cloudera.impalaodbc.ini configuration file.
Step 6 :- Check the entry and configuration of ODBC by typing
odbcinst -q -s
isql -v Sample_Cloudera_Impala_DSN_64
Trouble Shooting :-
Well I have got one error like
[S1000][unixODBC][Cloudera][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function.
which means that the driver is not linked to libodbcinst.so
Please check it first with the command
ldd
/opt/cloudera/impalaodbc/lib/64/libclouderaimpalaodbc64.so
then search for libodbcinst.so
find / -name "libodbcinst.so*"
If not found then install it
sudo apt-get update && sudo apt-get install unixodbc-dev libmyodbc
or
sudo apt-get install unixODBC unixODBC-dev
Then again try to search for
libodbcinst.so
and make entry in .bashrc as
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libodbcinst.so
For more info click here and here and here and here.
Subscribe to:
Posts (Atom)