Reducer
interface reduces a set of intermediate values which share a key to a smaller
set of values. Once the Map tasks have finished, data is then transferred
across the network to the Reducers. Although the Reducers may run on the same
physical machines as the Map tasks, there is no concept of data locality for
the Reducers.
All
values associated with a particular intermediate key are guaranteed to go to
the same Reducer. There may be a single Reducer, or multiple Reducers
corresponding to a job. Each key is assigned to a partition using a component called the partitioner. The default partitioner implementation is a hash
partitioner that takes a hash of the key, modulo the number of configured
reducers in the job, to get a partition number. The Hash implementation ensures
the hash of the key is always the same on all machines and the same key records
are guaranteed to be placed in the same partition. The intermediate data isn’t
physically partitioned, only logically so.
Reducer
implementations can access the Configuration
for the job via the JobContext.getConfiguration()
method.Reducer
has 3 primary phases:1. Shuffle
The process of moving map outputs to the reducers
using HTTP across the network is known as shuffling.
2. Sort
The framework merge sorts
Reducer
inputs by key
s
so as to merge the same key outputs from different Mapper
s.
To achieve a secondary sort on the values returned
by the value iterator, the application should extend the key with the secondary
key and define a grouping comparator. The keys will be sorted using the entire
key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce. The grouping comparator is
specified via
Job.setGroupingComparatorClass(Class)
.
The sort order is controlled by Job.setSortComparatorClass(Class)
.
The shuffle and sort phases occur simultaneously
i.e. while outputs are being fetched they are merged.
3. Reduce
For each key in
the partition assigned to a Reducer, the Reducer's reduce() method is called once. This receives a
key as well as an iterator over all the values associated with the key. The
values associated with a key are returned by the iterator in an undefined
order. The Reducer also receives as parameters OutputCollector and Reporter objects which are used in the same
manner as in the map() method.
The output of the reduce task is typically written to a RecordWriter
via TaskInputOutputContext.write(Object, Object)
.
The Reducer outputs zero or more final key/value pairs which are written to
HDFS. In practice, the Reducer usually emits a single key/value pair for each
input key.
No comments:
Post a Comment