In last few years Apache Hadoop has emerged as the technology for solving
Big Data problems and for improved
Business Analytics. One example of this is
How Sears Holding has moved to
Hadoop from the traditional Oracle Exadata, Teradata, SAS system. Another recent big entrant to Hadoop bandwagon is
Walmart’s Hadoop implementation.
In edureka blog they have discussed, how to create a Hadoop Cluster on AWS in 30 minutes.
In continuation to that, this blog talks about important
Hadoop Cluster Configuration Files.
The following table lists the same.
All these files are available under ‘
conf’ directory of Hadoop installation directory.
Here is a listing of these files in the File System:
Let’s look at the files and their usage one by one!
hadoop-env.sh
This file specifies environment variables that affect the JDK used by
Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in
Java and uses
Java Runtime environment, one of the important environment variables for Hadoop daemon is
$JAVA_HOME in
hadoop-env.sh. This variable directs Hadoop daemon to the
Java path in the system.
This file is also used for setting another
Hadoop daemon execution environment such as
heap size (HADOOP_HEAP),
hadoop home (HADOOP_HOME), log file location (HADOOP_LOG_DIR), etc.
Note: For the simplicity of understanding the cluster setup, we have configured only necessary parameters to start a cluster.
The following three files are the important configuration files for the runtime environment settings of a Hadoop cluster.
core-site.sh
This file informs Hadoop daemon where NameNode runs in the cluster.
It contains the configuration settings for Hadoop Core such as I/O
settings that are common to
HDFS and
MapReduce.
Where
hostname and
port are the machine and port on
which NameNode daemon runs and listens. It also informs the Name Node as
to which IP and port it should bind. The commonly used port is
8020 and you can also specify IP address rather than hostname.
hdfs-site.sh
This file contains the configuration settings for
HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.
You can also configure
hdfs-site.xml to specify default block
replication and permission checking on HDFS. The actual number of
replications can also be specified when the file is created. The default
is used if replication is not specified in create time.
The value “true” for property
‘dfs.permissions’ enables
permission checking in HDFS and the value “false” turns off the
permission checking. Switching from one parameter value to the other
does not change the mode, owner or group of files or directories.
mapred-site.sh
This file contains the configuration settings for MapReduce daemons;
the job tracker and the task-trackers. The
mapred.job.tracker parameter is a
hostname (or IP address) and
port
pair on which the Job Tracker listens for RPC communication. This
parameter specify the location of the Job Tracker to Task Trackers and
MapReduce clients.
You can replicate all of the four files explained above to all the
Data Nodes and Secondary Namenode. These files can then be configured
for any node specific configuration e.g. in case of a different
JAVA HOME on one of the Datanodes.
The following two file ‘masters’ and ‘slaves’ determine the master and salve Nodes in Hadoop cluster.
Masters
This file informs about the Secondary Namenode location to hadoop daemon. The ‘
masters’ file at Master server contains a hostname Secondary Name Node servers.
The ‘masters’ file on
Slave Nodes is blank.
Slaves
The ‘
slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node and Task Tracker servers.
The ‘
slaves’ file on Slave server contains the IP
address of the slave node. Notice that the ‘slaves’ file at Slave node
contains only its own IP address and not of any other Data Nodes in the
cluster.