MapReduce
primarily consists of two distinct things: the Framework and the Programming
model. The Programming Model involves jobs
that consist primarily of a map
function and a reduce function,
and the framework handles the parallelizing the work, scheduling parts of the
job on worker machines, monitoring for and recovering from failures, and so
forth.
The
two major daemons in Hadoop MapReduce Framework are jobtracker and tasktracker.
Jobtracker:
The
jobtracker is the master process, responsible for accepting job submissions from
clients, scheduling tasks to run on worker nodes, and providing administrative
functions such as worker health and task progress monitoring to the cluster.
There is one jobtracker per MapReduce cluster and it usually runs on reliable
hardware since a failure of the master will result in the failure of all
running jobs. Clients and tasktrackers communicate with the jobtracker by way
of remote procedure calls (RPC). The tasktrackers inform the jobtracker as to
their current health and status by way of regular heartbeats. Each heartbeat
contains the total number of map and reduce task slots available, the number occupied, and detailed
information about any currently executing tasks. Based on this information, the
Jobtracker decides which tasks of a job should be executed on which worker
nodes through Task Scheduling. After a configurable period of no heartbeats, a
tasktracker is assumed dead. The jobtracker uses a thread pool to process
heartbeats and client requests in parallel.
When
a job is submitted, information about each task that makes up the job is stored
in memory. This task information updates with each tasktracker heartbeat while
the tasks are running, providing a near real-time view of task progress and
health. After the job completes, this information is retained for a
configurable window of time or until a specified number of jobs have been
executed.
The
tasks in a MapReduce cluster share worker node resources, or space, but instead
of context switching—that is, pausing the execution of a task to give another
task time to run—when a task executes, it executes completely.
Tasktracker:
The
tasktracker, accepts task assignments from the jobtracker, instantiates the
user code, executes those tasks locally, and reports progress back to the
jobtracker periodically. There is always a single tasktracker on each worker
node. Both tasktrackers and datanodes run on the same machines, which makes
each node both a compute node and a storage node, respectively. Each
tasktracker is configured with a specific number of map and reduce task slots that indicate how many of
each type of task it is capable of executing in parallel. A tasktracker
executes some number of map tasks and reduce tasks in parallel.
Upon
receiving a task assignment from the jobtracker, the tasktracker executes an
Attempt of the task which is the physical instance of
that task being executed in a separate process. Communication between the task
attempt and the tasktracker is maintained via an RPC connection over the
loopback interface called the umbilical
protocol.
The
tasktracker uses a list of user-specified directories to hold the intermediate
map output and reducer input during job execution.
Programming Model:
In
the Proghramming model, users write a client
application that submits one or more jobs that contain user-supplied map and reduce code and a job
configuration file to a cluster of machines. A job processes an input dataset
specified by the user and usually outputs one as well. Commonly, the input and
output datasets are one or more files on HDFS.
A
MapReduce job is made up of four distinct stages, executed in order: client job
submission, map task execution, shuffle and sort, and reduce task execution.
The
client application submits a job to the cluster using the framework APIs. Jobtracker accepts these submissions
over the network. So, clients may or may not be running on one of the cluster
nodes. The input format component
decides how to split the input dataset into chunks or input splits, of data that can be processed in parallel. For
each input split, a map task is
created that runs the user-supplied map function on each record in the split.
Map tasks are executed in parallel. The map tasks are queued and executed in
whatever order the framework deems best, if there are more map tasks to execute
than the cluster can handle. The map function takes a key-value pair as input and produces zero or more intermediate
key-value pairs.
Each
key is assigned to a partition using
a component called the partitioner.
The default partitioner implementation is a hash partitioner that takes a hash
of the key, modulo the number of configured reducers in the job, to get a partition
number. The Hash implementation ensures the hash of the key is always the same
on all machines and the same key records are guaranteed to be placed in the
same partition. The intermediate data isn’t physically partitioned, only
logically so.
The
reduce function is run on the intermediate output data. The Shuffle and Sort
process which is performed by the reduce tasks before the user’s reduce
function is run, makes sure that all key values for a key is seen by exactly
one reducer in sorted order.
Once
the reducer has received its data, it is left with many small bits of its
partition, each of which is sorted by key. A merge sort takes a number of
sorted items and merges them together to form a fully sorted list using a
minimal amount of memory. Each reducer produces a separate output file, usually
in HDFS. Separate files are written so that reducers do not have to coordinate
access to a shared file. The format of the file depends on the output format specified by the author
of the MapReduce job in the job configuration.
Drawbacks of Mapreduce :-
MapReduce is a batch data processing system
and assumes that jobs will run on the order of minutes, if not hours. It
is optimized for full table scan style operations. It may not work well when
attempting to mimic low-latency, random access patterns found in traditional
online transaction processing systems.
MapReduce is overly simplistic. The
model may be limiting in cases where a certain optimizations is needed.
MapReduce is too low-level compared
to higher-level data processing languages such as SQL.
It
can be overkill.unless there is a need to touch terabytes or more of raw data.
There
are entire classes of problems that cannot easily be parallelized and MapReduce
may not be good solution to those problems.
No comments:
Post a Comment