Mapper
Interface maps input key-value pairs to a set of intermediate key-value pairs.
Maps are the individual tasks which transform input records into intermediate
records. A given input pair may map to zero or many output pairs. Hadoop
attempts to ensure that Mappers run on nodes which hold their portion of the
data locally, to avoid network traffic.
Multiple Mappers
run in parallel, each processing a portion of the input data. A new instance of
Mapper is instantiated in a separate Java process for each map task that makes
up part of the total job input. The individual mappers are intentionally not
provided with a mechanism to communicate with one another in any way. This
allows the reliability of each map task to be governed solely by the
reliability of the local machine. The map() method
receives two parameters in addition to the key and the value:
·
The OutputCollector object has a method named collect() which will forward a (key, value) pair
to the reduce phase of the job.
·
The Reporter object provides information about the
current task; its getInputSplit() method will return an object
describing the current InputSplit. It also allows the map task to provide
additional information about its progress to the rest of the system. The setStatus() method allows to emit a status message
back to the user. The incrCounter() method allows to increment shared
performance counters. Each mapper can increment the counters, and the
JobTracker will collect the increments made by the different processes and
aggregate them for later retrieval when the job ends.
The
map tasks are queued and executed in whatever order the framework deems best,
if there are more map tasks to execute than the cluster can handle.
Mapper
implementations can access the JobConf
for the job via the JobConfigurable.configure(JobConf)
and
initialize themselves. Similarly they can use the Closeable.close()
method for de-initialization. The
framework then calls map(Object, Object, OutputCollector, Reporter)
for each key/value pair in the InputSplit
for that task.
Whenever
possible, Hadoop will attempt to ensure that a Map task on a node is working on
a block of data stored locally on that node via HDFS. If this is not possible,
the Map task will have to transfer the data across the network as it processes
that data.
All intermediate values associated with a given output key are subsequently
grouped by the framework, and passed to a Reducer
to determine the final output. Users
can control the grouping by specifying Comparator
via JobConf.setOutputKeyComparatorClass(Class)
.
Hats off to your presence of mind..I really enjoyed reading your blog. I really appreciate your information which you shared with us.
ReplyDeleteHadoop Online Training
R Programming Online Training|
Data Science Online Training|