Mapreduce ways of working



MapReduce primarily consists of two distinct things: the Framework and the Programming model. The Programming Model involves jobs that consist primarily of a map function and a reduce function, and the framework handles the parallelizing the work, scheduling parts of the job on worker machines, monitoring for and recovering from failures, and so forth.

The two major daemons in Hadoop MapReduce Framework are jobtracker and tasktracker.


Jobtracker:
The jobtracker is the master process, responsible for accepting job submissions from clients, scheduling tasks to run on worker nodes, and providing administrative functions such as worker health and task progress monitoring to the cluster. There is one jobtracker per MapReduce cluster and it usually runs on reliable hardware since a failure of the master will result in the failure of all running jobs. Clients and tasktrackers communicate with the jobtracker by way of remote procedure calls (RPC). The tasktrackers inform the jobtracker as to their current health and status by way of regular heartbeats. Each heartbeat contains the total number of map and reduce task slots available, the number occupied, and detailed information about any currently executing tasks. Based on this information, the Jobtracker decides which tasks of a job should be executed on which worker nodes through Task Scheduling. After a configurable period of no heartbeats, a tasktracker is assumed dead. The jobtracker uses a thread pool to process heartbeats and client requests in parallel.

When a job is submitted, information about each task that makes up the job is stored in memory. This task information updates with each tasktracker heartbeat while the tasks are running, providing a near real-time view of task progress and health. After the job completes, this information is retained for a configurable window of time or until a specified number of jobs have been executed.

The tasks in a MapReduce cluster share worker node resources, or space, but instead of context switching—that is, pausing the execution of a task to give another task time to run—when a task executes, it executes completely.

Tasktracker:
The tasktracker, accepts task assignments from the jobtracker, instantiates the user code, executes those tasks locally, and reports progress back to the jobtracker periodically. There is always a single tasktracker on each worker node. Both tasktrackers and datanodes run on the same machines, which makes each node both a compute node and a storage node, respectively. Each tasktracker is configured with a specific number of map and reduce task slots that indicate how many of each type of task it is capable of executing in parallel. A tasktracker executes some number of map tasks and reduce tasks in parallel.

Upon receiving a task assignment from the jobtracker, the tasktracker executes an
Attempt of the task which is the physical instance of that task being executed in a separate process. Communication between the task attempt and the tasktracker is maintained via an RPC connection over the loopback interface called the umbilical protocol.

The tasktracker uses a list of user-specified directories to hold the intermediate map output and reducer input during job execution.

Programming Model:
In the Proghramming model, users write a client application that submits one or more jobs that contain user-supplied map and reduce code and a job configuration file to a cluster of machines. A job processes an input dataset specified by the user and usually outputs one as well. Commonly, the input and output datasets are one or more files on HDFS.

A MapReduce job is made up of four distinct stages, executed in order: client job submission, map task execution, shuffle and sort, and reduce task execution.

The client application submits a job to the cluster using the framework APIs. Jobtracker accepts these submissions over the network. So, clients may or may not be running on one of the cluster nodes. The input format component decides how to split the input dataset into chunks or input splits, of data that can be processed in parallel. For each input split, a map task is created that runs the user-supplied map function on each record in the split. Map tasks are executed in parallel. The map tasks are queued and executed in whatever order the framework deems best, if there are more map tasks to execute than the cluster can handle. The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs.



Each key is assigned to a partition using a component called the partitioner. The default partitioner implementation is a hash partitioner that takes a hash of the key, modulo the number of configured reducers in the job, to get a partition number. The Hash implementation ensures the hash of the key is always the same on all machines and the same key records are guaranteed to be placed in the same partition. The intermediate data isn’t physically partitioned, only logically so.

The reduce function is run on the intermediate output data. The Shuffle and Sort process which is performed by the reduce tasks before the user’s reduce function is run, makes sure that all key values for a key is seen by exactly one reducer in sorted order.

Once the reducer has received its data, it is left with many small bits of its partition, each of which is sorted by key. A merge sort takes a number of sorted items and merges them together to form a fully sorted list using a minimal amount of memory. Each reducer produces a separate output file, usually in HDFS. Separate files are written so that reducers do not have to coordinate access to a shared file. The format of the file depends on the output format specified by the author of the MapReduce job in the job configuration.


Drawbacks of Mapreduce :-



MapReduce is a batch data processing system and assumes that jobs will run on the order of minutes, if not hours. It is optimized for full table scan style operations. It may not work well when attempting to mimic low-latency, random access patterns found in traditional online transaction processing systems.

MapReduce is overly simplistic. The model may be limiting in cases where a certain optimizations is needed.

MapReduce is too low-level compared to higher-level data processing languages such as SQL.
It can be overkill.unless there is a need to touch terabytes or more of raw data.

There are entire classes of problems that cannot easily be parallelized and MapReduce may not be good solution to those problems.

No comments:

Post a Comment