https://www.dezyre.com/hadoop-tutorial/hadoop-multinode-cluster-setup
It is essential to prepare yourself in order to pass an interview and land your dream job. Here’s the first step to achieving this. The following are some frequently asked Hadoop Administration interview questions and answers that might be useful.
Explain check pointing in Hadoop and why is it important?
Check pointing is an essential part of
maintaining and persisting filesystem metadata in HDFS. It’s crucial for
efficient Namenode recovery and restart and is an important indicator
of overall cluster health.
Namenode persists filesystem metadata.
At a high level, namenode’s primary responsibility is to store the HDFS
namespace. Meaning, things like the directory tree, file permissions
and the mapping of files to block IDs. It is essential that this
metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in
two different parts: the fsimage and the edit log. The fsimage is a file
that represents a point-in-time snapshot of the filesystem’s metadata.
However, while the fsimage file format is very efficient to read, it’s
unsuitable for making small incremental updates like renaming a single
file. Thus, rather than writing a new fsimage every time the namespace
is modified, the NameNode instead records the modifying operation in the
edit log for durability. This way, if the NameNode crashes, it can
restore its state by first loading the fsimage then replaying all the
operations (also called edits or transactions) in the edit log to catch
up to the most recent state of the namesystem. The edit log comprises a
series of files, called edit log segments, that together represent all
the namesystem modifications made since the creation of the fsimage.
What is default block size in HDFS and what are the benefits of having smaller block sizes?
Most block-structured file systems use a
block size on the order of 4 or 8 KB. By contrast, the default block
size in HDFS is 64MB – and larger. This allows HDFS to decrease the
amount of metadata storage required per file. Furthermore, it allows
fast streaming reads of data, by keeping large amounts of data
sequentially organized on the disk. As a result, HDFS is expected to
have very large files that are read sequentially. Unlike a file system
such as NTFS or EXT which has numerous small files, HDFS stores a modest
number of very large files: hundreds of megabytes, or gigabytes each.
What are two main modules which help you interact with HDFS and what are they used for?
user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The moduleName tells the program which
subset of Hadoop functionality to use. -cmd is the name of a specific
command within this module to execute. Its arguments follow the command
name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’,
provides basic file manipulation operations and works with objects
within the file system. The dfsadmin module manipulates or queries the
file system as a whole.
How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?
Datanodes can store blocks in multiple
directories typically located on different local disk drives. In order
to setup multiple directories one needs to specify a comma separated
list of pathnames as values under config paramters
dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place
equal amount of data in each of the directories.
Namenode also supports multiple
directories, which stores the name space image and edit logs. In order
to setup multiple directories one needs to specify a comma separated
list of pathnames as values under config paramters
dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used
for the namespace data replication so that image and log could be
restored from the remaining disks/volumes if one of the disks fails.
How do you read a file from HDFS?
The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program
reads the cluster config file on the local machine which tells it where
the namemode is located. This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The client’s validated request is checked against the owner and permissions of the file.
Step 6: If the file
exists and the user has access to it then the NameNode responds with the
first block id and provides a list of datanodes a copy of the block can
be found, sorted by their distance to the client (reader).
Step 7: The client now
contacts the most appropriate datanode directly and reads the block
data. This process repeats until all blocks in the file have been read
or the client closes the file stream.
If while reading the file the datanode
dies, library will automatically attempt to read another replica of the
data from another datanode. If all replicas are unavailable, the read
operation fails and the client receives an exception. In case
the information returned by the NameNode about block locations are
outdated by the time the client attempts to contact a datanode, a retry
will occur if there are other replicas or the read will fail.
What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?
Schedulers are responsible for assigning
tasks to open slots on tasktrackers. The scheduler is a plug-in within
the jobtracker. The three types of schedulers are:
- FIFO (First in First Out) Scheduler
- Fair Scheduler
- Capacity Scheduler
How do you decide which scheduler to use?
The CS scheduler can be used under the following situations:
- When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
- When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
- When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
- When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
- When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
- When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
- When you require jobs within a pool to make equal progress rather than running in FIFO order.
Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?
DFS.NAME.DIR specifies the path of
directory in Namenode’s local file system to store HDFS’s metadata
and DFS.DATA.DIR specifies the path of directory in Datanode’s local
file system to store HDFS’s file blocks. These paramters are specified
in HDFS-SITE.XML config file of all nodes in the cluster, including
master and slave nodes.
If these paramters are not specified,
namenode’s metadata and Datanode’s file blocks related information gets
stored in /tmp under HADOOP-USERNAME directory. This is not a safe
place, as when nodes are restarted, data will be lost and is critical if
Namenode is restarted, as formatting information will be lost.
What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?
FileSystem checking utility FSCK is used
to check and display the health of file system, files and blocks in it.
When used with a path ( bin/Hadoop fsck / -files –blocks –locations
-racks) it recursively shows the health of all files under the path. And
when used with ‘/’ , it checks the entire file system. By Default FSCK
ignores files still open for writing by a client. To list such files,
run FSCK with -openforwrite option.
FSCK checks the file system, prints out a
dot for each file found healthy, prints a message of the ones that are
less than healthy, including the ones which have over replicated blocks,
under-replicated blocks, mis-replicated blocks, corrupt blocks and
missing replicas.
What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?
The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
- Hadoop-env.sh
- Core-site.xml
- Hdfs-site.xml
- Mapred-site.xml
- Masters
- Slaves
These files can be found in your
Hadoop>conf directory. If Hadoop daemons are started individually
using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of
daemon, then masters and slaves file need not be updated and can be
empty. This way of starting daemons requires command to be issued on
appropriate nodes to start appropriate daemons. If Hadoop daemons are
started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters
and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.