Hadoop Administration Interview Questions and Answers


https://www.dezyre.com/hadoop-tutorial/hadoop-multinode-cluster-setup


It is essential to prepare yourself in order to pass an interview and land your dream job. Here’s the first step to achieving this. The following are some frequently asked Hadoop Administration interview questions and answers that might be useful.

Explain check pointing in Hadoop and why is it important?

Check pointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient Namenode recovery and restart and is an important indicator of overall cluster health.
Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is  to store the HDFS namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essential that this metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

What is default block size in HDFS and what are the benefits of having smaller block sizes?

Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file. Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of megabytes, or gigabytes each.

What are two main modules which help you interact with HDFS and what are they used for?

user@machine:hadoop$ bin/hadoop moduleName-cmdargs…
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within the file system. The dfsadmin module manipulates or queries the file system as a whole.

How can I setup Hadoop nodes (data nodes/namenodes) to use multiple volumes/disks?

Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup multiple directories one needs to specify a comma separated list of pathnames as values under config paramters dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that image and log could be restored from the remaining disks/volumes if one of the disks fails.

How do you read a file from HDFS?

The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program reads the cluster config file on the local machine which tells it where the namemode is located. This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The client’s validated request is checked against the owner and permissions of the file.
Step 6: If the file exists and the user has access to it then the NameNode responds with the first block id and provides a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
Step 7: The client now contacts the most appropriate datanode directly and reads the block data. This process repeats until all blocks in the file have been read or the client closes the file stream.
If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case the information returned by the NameNode about block locations are outdated by the time the client attempts to contact a datanode, a retry will occur if there are other replicas or the read will fail.

What are schedulers and what are the three types of schedulers that can be used in Hadoop cluster?

Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the jobtracker. The three types of schedulers are:
  • FIFO (First in First Out) Scheduler
  • Fair Scheduler
  • Capacity Scheduler

How do you decide which scheduler to use?

The CS scheduler can be used under the following situations:
  • When you know a lot about your cluster workloads and utilization and simply want to enforce resource allocation.
  • When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes sense when all queues are at capacity almost all the time.
  • When you have high variance in the memory requirements of jobs and you need the CS’s memory-based scheduling support.
  • When you demand scheduler determinism.
The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
  • When you have a slow network and data locality makes a significant difference to a job runtime, features like delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
  • When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre-emption model affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re not used.
  • When you require jobs within a pool to make equal progress rather than running in FIFO order.

Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where are they specified and what happens if you don’t specify these parameters?

DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.
If these paramters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be lost and is critical if Namenode is restarted, as formatting information will be lost.

What is file system checking utility FSCK used for? What kind of information does it show? Can FSCK show information about files which are open for writing by a client?

FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files under the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing by a client. To list such files, run FSCK with -openforwrite option.
FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks, corrupt blocks and missing replicas.

What are the important configuration files that need to be updated/edited to setup a fully distributed mode of Hadoop cluster 1.x ( Apache distribution)?

The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:
  • Hadoop-env.sh
  • Core-site.xml
  • Hdfs-site.xml
  • Mapred-site.xml
  • Masters
  • Slaves
These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using ‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then masters and slaves configurations files on namenode machine need to be updated.
Masters – Ip address/hostname of node where secondarynamenode will run.
Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.