What is MapReduce?

MapReduce[1] is a simple programming model for processing huge data sets in parallel. The basic notion of MapReduce is to divide a task into subtasks, handle the sub-tasks in parallel, and aggregate the results of the subtasks to form the final output. Programs written in MapReduce are automatically parallelized: programmers do not need to be concerned about the implementation details of parallel processing. Instead, programmers write two functions: map and reduce. The map phase reads the input (in parallel) and distributes the data to the reducers. Auxiliary phases such as sorting, partitioning and combining values can also take place between the map and reduce phases.

MapReduce programs are generally used to process large files. The input and output for the map and reduce functions are expressed in the form of key-value pairs. The usage and details of key-value pairs are discussed in detail in the map and reduce sections below.

A Hadoop MapReduce program also has a component called the Driver. The driver is responsible for initializing the job with its configuration details, specifying the mapper and the reducer classes for the job, informing the Hadoop platform to execute the code on the specified input file(s) and controlling the location where the output files are placed.[2]

map(inKey, inValue) -> list(intermediateKey, intermediateValue)

The purpose of the map phase is to organize the data in preparation for the processing done in the reduce phase. The input to the map function is in the form of key-value pairs, even though the input to a MapReduce program is a file or file(s). By default, the value is a data record and the key is generally the offset of the data record from the beginning of the data file.

The output consists of a collection of key-value pairs which are input for the reduce function. The content of the key-value pairs depends on the specific implementation.

For example, a common initial program implemented in MapReduce is to count words in a file. [1] The input to the mapper is each line of the file, while the output from each mapper is a set of key-value pairs where one word is the key and the number 1 is the value.

To optimize the processing capacity of the map phase, MapReduce can run several identical mappers in parallel.[3] Since every mapper is the same, they produce the same result as running one map function.

reduce(intermediateKey, list(intermediateValue)) -> list(outKey, outValue)

Each reduce function processes the intermediate values for a particular key generated by the map function and generates the output.[1] Essentially there exists a one-one mapping between keys and reducers. Several reducers can run in parallel, since they are independent of one another. The number of reducers is decided by the user. By default, the number of reducers is 1.

The following figure modified from http://developer.yahoo.com/hadoop/tutorial/module4.html and Pro Hadoop by Jason Venner [page 2] illustrates the MapReduce process, which explains the map and reduce functions:

Overview of MapReduce execution

How to write a MapReduce program

A typical MapReduce program will have three main components - a driver class, a mapper class and a reducer class. The basic driver initializes the job configuration, defines the mapper and the reducer and specifies the paths of the input and output files(s) for the MapReduce program.

The mapper and reducer classes in all the examples extend classes from the package org.apache.hadoop.mapreduce. [4]

The Mapper and Reducer classes are generic, so the user can specify type information during the class declaration.[5]

The generic type arguments for the Mapper and Reducer classes are <KEYIN, VALUEIN, KEYOUT, VALUEOUT>.[6]

The map function inside a Mapper class has a structure like the following pseudocode.[1]

map ( InputKeyType inputKey, InputValueType inputValue):
  • //do processing on the inputKey and inputValue and form intermediateKey , intermediateValue pair
  • Emit(intermediateKey, intermediateValue);
  • // map can emit more than one intermediate key-value pairs

The reduce function inside a Reducer class has a structure similar to the following pseudocode. Since the output of the map method forms the input of the reduce method, the types of the input key and input value to the reduce method should be the same as the types of the output key and output value of the map method.

reduce(IntermediateKeyType intermediateKey, Iterator values):
  • //the values associated with a particular intermediateKey is iterated and a user-specified operation is performed over the values
  • // multiple reducers can run in parallel and the number of reducers is specified by the user
  • //outputKey will contain the key output from the reducer and outputValue will contain the value that is output for that particular key
  • Emit(outputKey, outputValue);
  • // reduce method can emit more than one output key-value pairs

To get started with writing an actual MapReduce program, visit the Intersection example.
  1. ^ MapReduce: Simplied Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat, Google Inc.
  2. ^ http://developer.yahoo.com/hadoop/tutorial/module4.html
  3. ^ http://ntier.wordpress.com/
  4. ^ http://sonerbalkir.blogspot.com/2010/01/new-hadoop-api-020x.html
  5. ^ **
  6. ^ **