Skip to main content
Try Wikispaces Classroom now.
Brand new from Wikispaces.
Pages and Files
Setting up your first Hadoop program
The First MapReduce Graph program
Iterative MapReduce and Counters
User-defined Configuration parameters
Multiple mappers and reducers
Sorting feature of MapReduce
Advanced sections - MapReduce and HDFS
Other Hadoop-related projects
What is MapReduce?
is a simple programming model for processing huge data sets in parallel. The basic notion of MapReduce is to divide a task into subtasks, handle the sub-tasks in parallel, and aggregate the results of the subtasks to form the final output. Programs written in MapReduce are automatically parallelized: programmers do not need to be concerned about the implementation details of parallel processing. Instead, programmers write two functions: map and reduce. The map phase reads the input (in parallel) and distributes the data to the reducers. Auxiliary phases such as sorting, partitioning and combining values can also take place between the map and reduce phases.
MapReduce programs are generally used to process large files. The input and output for the map and reduce functions are expressed in the form of key-value pairs. The usage and details of key-value pairs are discussed in detail in the
A Hadoop MapReduce program also has a component called the Driver. The driver is responsible for initializing the job with its configuration details, specifying the mapper and the reducer classes for the job, informing the Hadoop platform to execute the code on the specified input file(s) and controlling the location where the output files are placed.
map(inKey, inValue) -> list(intermediateKey, intermediateValue)
The purpose of the
phase is to organize the data in preparation for the processing done in the
phase. The input to the map function is in the form of key-value pairs, even though the input to a MapReduce program is a file or file(s). By default, the value is a data record and the key is generally the offset of the data record from the beginning of the data file.
The output consists of a collection of key-value pairs which are input for the
function. The content of the key-value pairs depends on the specific implementation.
For example, a common initial program implemented in MapReduce is to count words in a file. [
] The input to the mapper is each line of the file, while the output from each mapper is a set of key-value pairs where one word is the key and the number 1 is the value.
To optimize the processing capacity of the
phase, MapReduce can run several identical mappers in parallel.
Since every mapper is the same, they produce the same result as running one map function.
reduce(intermediateKey, list(intermediateValue)) -> list(outKey, outValue)
function processes the intermediate values for a particular key generated by the
function and generates the output.[
] Essentially there exists a one-one mapping between keys and reducers. Several reducers can run in parallel, since they are independent of one another. The number of reducers is decided by the user. By default, the number of reducers is 1.
The following figure modified from
Pro Hadoop by Jason Venner [page 2]
illustrates the MapReduce process, which explains the map and reduce functions:
Overview of MapReduce execution
How to write a MapReduce program
A typical MapReduce program will have three main components - a driver class, a mapper class and a reducer class. The basic driver initializes the job configuration, defines the mapper and the reducer and specifies the paths of the input and output files(s) for the MapReduce program.
The mapper and reducer classes in all the examples extend classes from the package
The Mapper and Reducer classes are generic, so the user can specify type information during the class declaration
The generic type arguments for the Mapper and Reducer classes are <KEYIN, VALUEIN, KEYOUT, VALUEOUT>.
The map function inside a Mapper class has a structure like the following pseudocode.[
//do processing on the
// map can emit more than one intermediate key-value pairs
The reduce function inside a Reducer class has a structure similar to the following pseudocode. Since the output of the map method forms the input of the reduce method, the types of the input key and input value to the reduce method should be the same as the types of the output key and output value of the map method.
//the values associated with a particular intermediateKey is iterated and a user-specified operation is performed over the values
// multiple reducers can run in parallel and the number of reducers is specified by the user
will contain the key output from the reducer and
will contain the value that is output for that particular key
// reduce method can emit more than one output key-value pairs
To get started with writing an actual MapReduce program, visit the
MapReduce: Simplied Data Processing on Large Clusters
by Jeffrey Dean and Sanjay Ghemawat, Google Inc.
help on how to format text
Turn off "Getting Started"