Reducer Interface



Reducer interface reduces a set of intermediate values which share a key to a smaller set of values. Once the Map tasks have finished, data is then transferred across the network to the Reducers. Although the Reducers may run on the same physical machines as the Map tasks, there is no concept of data locality for the Reducers.

All values associated with a particular intermediate key are guaranteed to go to the same Reducer. There may be a single Reducer, or multiple Reducers corresponding to a job. Each key is assigned to a partition using a component called the partitioner. The default partitioner implementation is a hash partitioner that takes a hash of the key, modulo the number of configured reducers in the job, to get a partition number. The Hash implementation ensures the hash of the key is always the same on all machines and the same key records are guaranteed to be placed in the same partition. The intermediate data isn’t physically partitioned, only logically so.

Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.
Reducer has 3 primary phases:

1.      Shuffle

The process of moving map outputs to the reducers using HTTP across the network is known as shuffling.

2.      Sort

The framework merge sorts Reducer inputs by keys so as to merge the same key outputs from different Mappers.
To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. The grouping comparator is specified via Job.setGroupingComparatorClass(Class). The sort order is controlled by Job.setSortComparatorClass(Class).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

3.      Reduce

For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a key as well as an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The Reducer also receives as parameters OutputCollector and Reporter objects which are used in the same manner as in the map() method.
The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The Reducer outputs zero or more final key/value pairs which are written to HDFS. In practice, the Reducer usually emits a single key/value pair for each input key.

No comments:

Post a Comment