Hadoop for Beginner: August 2015

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Flume lets Hadoop users make the most of valuable log data. Specifically, Flume allows users to:

Stream data from multiple sources into Hadoop for analysis
Collect high-volume Web logs in real time
Insulate themselves from transient spikes when the rate of incoming data exceeds the rate at which data can be written to the destination
Guarantee data delivery
Scale horizontally to handle additional data volume

Flume’s high-level architecture is focused on delivering a streamlined codebase that is easy-to-use and easy-to-extend. The project team has designed Flume with the following components:

Event – a singular unit of data that is transported by Flume (typically a single log entry
Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
Client – produces and transmits the Event to the Source operating within the Agent

A flow in Flume starts from the Client (Web Server). The Client transmits the event to a Source operating within the Agent. The Source receiving this event then delivers it to one or more Channels. These Channels are drained by one or more Sinks operating within the same Agent. Channels allow decoupling of ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than what the provisioned capacity on the destination can handle, the channel size increases. This allows sources to continue normal operation for the duration of the spike. Flume agents can be chained together by connecting the sink of one agent to the source of another agent. This enables the creation of complex dataflow topologies.

Now we will install apache flume on our virtual machine.

STEP 1:

Download flume:
Command: wget http://archive.apache.org/dist/flume/1.4.0/apache-flume-1.4.0-bin.tar.gz

Command: ls

STEP 2:

Extract file from flume tar file.
Command: tar -xvf apache-flume-1.4.0-bin.tar.gz
Command: ls

STEP 3:

Put apache-flume-1.4.0-bin directory inside /usr/lib/ directory.

Command: sudo mv apache-flume-1.4.0-bin /usr/lib/

STEP 4:

We need to remove protobuf-java-2.4.1.jar and guava-10.1.1.jar from lib directory of apache-flume-1.4.0-bin ( when using hadoop-2.x )

Command: sudo rm /usr/lib/apache-flume-1.4.0-bin/lib/protobuf-java-2.4.1.jar /usr/lib/apache-flume-1.4.0-bin/lib/guava-10.0.1.jar

STEP 5:

Use below link and download flume-sources-1.0-SNAPSHOTS.jar
https://drive.google.com/file/d/0B-Cl0IfLnRozUHcyNDBJWnNxdHc/view?usp=sharing

Save the file.

STEP 6:

Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:

Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /usr/lib/apache-flume-1.4.0-bin/lib/

STEP 7:

Check whether flume SNAPSHOT has moved to the lib folder of apache flume:

Command: ls /usr/lib/apache-flume-1.4.0-bin/lib/flume*

STEP 8:

Copy flume-env.sh.template content to flume-env.sh

Command: cd /usr/lib/apache-flume-1.4.0-bin/

Command: sudo cp conf/flume-env.sh.template conf/flume-env.sh

STEP 9:

Edit flume-env.sh as mentioned in below snapshot.

command: sudo gedit conf/flume-env.sh

Set JAVA_HOME and FLUME_CLASSPATH as shown in below snapshot.

Now we have installed flume on our machine. Lets run flume to stream twitter data on to HDFS.
We need to create an application in twitter and use its credentials to fetch data.

STEP 10:

Open a Browser and go to the below URL:

URL:https://twitter.com/

STEP 11:

Enter your Twitter account credentials and sign in:

STEP 12:

Your twitter home page will open:

STEP 13:

Change the URL to https://apps.twitter.com

STEP 14:

Click on Create New App to create a new application and enter all the details in the application:

STEP 15:

Check Yes, I agree and click on Create your Twitter application:

STEP 16:

Your Application will be created:

STEP 17:

Click on Keys and Access Tokens, you will get Consumer Key and Consumer Secret.

STEP 18:

Scroll down and Click on Create my access token:

Your Access token got created:

Consumer Key (API Key) 4AtbrP50QnfyXE2NlYwROBpTm
Consumer Secret (API Secret) jUpeHEZr5Df4q3dzhT2C0aR2N2vBidmV6SNlEELTBnWBMGAwp3
Access Token 1434925639-p3Q2i3l2WLx5DvmdnFZWlYNvGdAOdf5BrErpGKk
Access Token Secret AghOILIp9JJEDVFiRehJ2N7dZedB1y4cHh0MvMJN5DQu7

STEP 19:

Use below link to download flume.conf file
https://drive.google.com/file/d/0B-Cl0IfLnRozdlRuN3pPWEJ1RHc/view?usp=sharing

Save the file.

STEP 20:

Put the flume.conf in the conf directory of apache-flume-1.4.0-bin
Command: sudo cp /home/centos/Downloads/flume.conf /usr/lib/apache-flume-1.4.0-bin/conf/

STEP 21:

Edit flume.conf

Command: sudo gedit conf/flume.conf

Replace all the below highlighted credentials in flume.conf with the credentials (Consumer Key, Consumer Secret, Access Token, Access Token Secret) you received after creating the application very carefully, rest all will remain same, save the file and close it.

STEP 22:

Change permissions for flume directory.

Command: sudo chmod -R 755 /usr/lib/apache-flume-1.4.0-bin/

STEP 23:

Start fetching the data from twitter:

Command: ./bin/flume-ng agent -n TwitterAgent -c conf -f /usr/lib/apache-flume-1.4.0-bin/conf/flume.conf

Now wait for 20-30 seconds and let flume stream the data on HDFS, after that press ctrl + c to break the command and stop the streaming. (Since you are stopping the process, you may get few exceptions, ignore it)

STEP 24:

Open the Mozilla browser in your VM, and go to /user/flume/tweets in HDFS

Click on FlumeData file which got created:

If you can see data similar as shown in below snapshot, then the unstructured data has been streamed from twitter on to HDFS successfully. Now you can do analytics on this twitter data using Hive.

The R-Factor

There is often a gap in what we are taught in college and the knowledge that we need to possess to be successful in our professional lives. This is exactly what happened to me when I joined a consultancy firm as a business analyst. At that time I was a fresher coming straight from the cool college atmosphere, newly exposed to the Corporate Heat.
One day my boss called me to his office and told me that one of their clients, a big insurance company, was facing significant losses on auto insurance. They had hired us to identify and quantify the factors responsible for it. My boss emailed me the data that the company had provided and asked me to do a multivariate linear regression analysis on it. My boss told me to use R and make a presentation of the summary.
Now as a statistics student I was quite aware of the principles of a multivariate linear regression, but I had never used R. For those of you who are not aware, R is a statistical programming language. It is a very powerful tool and widely used across the world in analyzing data. Of course, I did not know this at that time.
Anyways, it took me a lot of surfing on the internet and reading books to learn how to fit my model in R. and now I want to help you guys save that time!
R is an open source tool easily available on the internet. I'll assume you have it installed on your computer. Else you can easily download and install it from www.r-project.org/
I have already converted the raw data file from the client into a clean .csv (comma separated) file. click here to download the file.
I've saved this on the D drive of computer in a folder called Linear_Reg_Sample. You can save it anywhere, but remember to change the path wherever a file path is mentioned.
Open the R software that you've installed. It's time to get started!

Let's Start Regression in R

The first thing to do is obviously read all our data in R. This can be easily done using the command: >LinRegData <- read.csv(file = "D:\\Linear Reg using R\\Linear_Reg_Sample_Data.csv")
Here we read all the data into an object LinRegData, using a function read.csv().
NOTE: If you observe closely, you'll see that we have used \\ instead of a \. This is because of the construct of the language. Whenever you enter a path, make sure to use \\
Let's see if our data has been read by R. Use the following command to get a summary of the data: >summary(LinRegData)
This will give output

Image 1: Summary of input data
In the output you can see the distribution of data. The min, max, median, mean are shown for all the variables.

Performing the Regression Analysis

Now that the data has been loaded, we need to fit a regression model over it.
We will use the following command in R to fit the model: >FitLinReg <- lm(Capped_Losses ~ Number_Vehicles + Average_Age + Gender_Dummy + Married_Dummy + Avg_Veh_Age + Fuel_Type_Dummy, LinRegData)
In this command, we create an object FitLinReg and store the results of our regression model in it. The lm() function is used to fit the model. Inside the model, Capped_Losses is our dependent variable which we are trying to explain using the other variables that are separated by a + sign. The last parameter of the formula is the source of the data.
If no error is displayed, it means our regression is done and the results are stored in FitLinReg. We can see the results using two commands:

1. >FitLinReg

This gives the output:

2. >summary(FitLinReg)

This gives the output:

The summary command gives us the intercepts of each variable, its standard error, t value and significance.

The output also tells us what the significance level of each variable is. For e.g., a *** variable highly is significant, a ** variable is significant at the 99.9% level and a space next to the variable indicates that it is not significant.

We can easily see that the Number_Vehicles variable is not significant and does not affect the model. We can remove this variable from the model.

If you go through what we've done till now, you will realize that it took us just two commands to fit a multivariate model in R. See how simple life has become!!!

Happy Ending!

In this way I learnt how to fit a regression model using R. I made a summary of my findings and made a presentation to the clients.

My boss was rather happy with me and I received a hefty bonus that year.

Hadoop for Beginner

HTML/JavaScript

Flume Installation and Streaming Twitter Data Using Flume

Analytics Tutorial: Learn Linear Regression in R

The R-Factor

Let's Start Regression in R

Performing the Regression Analysis

Happy Ending!

HTML/JavaScript

document.write(ssyby);

Flume Installation and Streaming Twitter Data Using Flume

document.write(ssyby);

Analytics Tutorial: Learn Linear Regression in R

The R-Factor

Let's Start Regression in R

Performing the Regression Analysis

Happy Ending!