HDFS
has master/slave architecture. An HDFS cluster consists of NameNode, a master
server that manages the file system namespace and regulates access to files by
clients. In addition, there are a number of DataNodes, usually one per node in
the cluster, which manage storage attached to the nodes that they run on. HDFS
exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored
in a set of DataNodes. The NameNode executes file system namespace operations
like opening, closing, and renaming files and directories. It also determines
the mapping of blocks to DataNodes. The DataNodes are responsible for serving
read and write requests from the file system’s clients. The DataNodes also
perform block creation, deletion, and replication upon instruction from the
NameNode.
HDFS
is built using the Java language; any machine that supports Java can run the
NameNode or the DataNode software. Usage of the highly portable Java language
means that HDFS can be deployed on a wide range of machines. A typical
deployment has a dedicated machine that runs only the NameNode software. Each
of the other machines in the cluster runs one instance of the DataNode software.
The architecture does not preclude running multiple DataNodes on the same
machine but in a real deployment that is rarely the case.
HDFS
supports a traditional hierarchical file organization. A user or an application
can create directories and store files inside these directories. HDFS is
designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the
last block are the same size. The blocks of a file are replicated for fault
tolerance. The block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The replication
factor can be specified at file creation time and can be changed later. The
NameNode makes all decisions regarding replication of blocks. It periodically
receives a Heartbeat and a Blockreport from each of the DataNodes in the
cluster. Receipt of a Heartbeat implies that the DataNode is functioning
properly. A Blockreport contains a list of all blocks on a DataNode. Files in
HDFS are write-once and have strictly one writer at any time.
HDFS
is designed to run on commodity hardware. HDFS is highly fault-tolerant
and is designed to be deployed on low-cost hardware. HDFS provides high throughput
access to application data and is suitable for applications that have large
data sets. HDFS relaxes a few POSIX requirements to enable streaming access to
file system data.
In order to read the data, the
client begins by contacting the namenode, indicating which file it would like
to read. The client identity is first validated either by trusting the client
and allowing it to specify a username or by using a strong authentication
mechanism such as Kerberos and then checked against the owner and permissions
of the file. If the file exists and the user has access to it, the namenode
responds to the client with the first block
ID and the list of datanodes on which a copy of the block can be found,
sorted by their distance to the client. Distance to the client is measured
according to Hadoop’s rack topology, configuration data that indicates which
hosts are located in which racks. If the namenode is unavailable for some
reason the client will receive timeouts or exceptions and will be unable to
proceed. With the block IDs and datanode hostnames, the client can contacts the
most appropriate datanode directly and read the block data it needs. This
process repeats.
No comments:
Post a Comment