Hadoop for Beginner: Mapper Interface

Mapper Interface maps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual tasks which transform input records into intermediate records. A given input pair may map to zero or many output pairs. Hadoop attempts to ensure that Mappers run on nodes which hold their portion of the data locally, to avoid network traffic.

Multiple Mappers run in parallel, each processing a portion of the input data. A new instance of Mapper is instantiated in a separate Java process for each map task that makes up part of the total job input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. This allows the reliability of each map task to be governed solely by the reliability of the local machine. The map() method receives two parameters in addition to the key and the value:

· The OutputCollector object has a method named collect() which will forward a (key, value) pair to the reduce phase of the job.

· The Reporter object provides information about the current task; its getInputSplit() method will return an object describing the current InputSplit. It also allows the map task to provide additional information about its progress to the rest of the system. The setStatus() method allows to emit a status message back to the user. The incrCounter() method allows to increment shared performance counters. Each mapper can increment the counters, and the JobTracker will collect the increments made by the different processes and aggregate them for later retrieval when the job ends.

The map tasks are queued and executed in whatever order the framework deems best, if there are more map tasks to execute than the cluster can handle.

Mapper implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) and initialize themselves. Similarly they can use the Closeable.close() method for de-initialization. The framework then calls map(Object, Object, OutputCollector, Reporter) for each key/value pair in the InputSplit for that task.

Whenever possible, Hadoop will attempt to ensure that a Map task on a node is working on a block of data stored locally on that node via HDFS. If this is not possible, the Map task will have to transfer the data across the network as it processes that data.

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the grouping by specifying Comparator via JobConf.setOutputKeyComparatorClass(Class).

Hadoop for Beginner

HTML/JavaScript

Mapper Interface

1 comment:

HTML/JavaScript

document.write(ssyby);

Mapper Interface

1 comment: