Hadoop architecture

This section describes the the various components of Hadoop: parts of the MapReduce job process, the handling of the data, and the architecture of the file system. As the book Hadoop In Action by Chuck Lam [page 22] says "Hadoop employs a master/slave architecture for both distributed storage and distributed computation". In the distributed storage, the NameNode is the master and the DataNodes are the slaves. In the distributed computation, the Jobtracker is the master and the Tasktrackers are the slaves which are explained in the following sections.

MapReduce Job Processing

An entire Hadoop execution of a client request is called a job. Users can submit job requests to the Hadoop framework, and the framework processes the jobs. Before the framework can process a job, the user must specify the following:
  • The location of the input and output files in the distributed file system
  • The input and output formats
  • The classes containing the map and reduce functions
Hadoop has four entities involved in the processing of a job:[1]
  • The user, who submits the job and specifies the configuration.
  • Hadoop architectureThe JobTracker, a program which coordinates and manages the jobs. It accepts job submissions from users, provides job monitoring and control, and manages the distribution of tasks in a job to the TaskTracker nodes.[2] Usually there is one JobTracker per cluster.
  • The TaskTrackers manage the tasks in the process, such as the map task, the reduce task, etc. There can be one or more TaskTracker processes per node in a cluster.
  • The distributed file system, such as HDFS.

The following figure from Pro Hadoop by Jason Venner [page 28] shows the different tasks of a MapReduce job.
Parts of a MapReduce job

The user specifies the job configuration by setting different parameters specific to the job. The user also specifies the number of reducer tasks and the reduce function. The user also has to specify the format of the input, and the locations of the input. The Hadoop framework uses this information to split of the input into several pieces. Each input piece is fed into a user-defined map function. The map tasks process the input data and emit intermediate data. The output of the map phase is sorted and a default or custom partitioning may be applied on the intermediate data. Accordingly, the reduce function processes the data in each partition and merges the intermediate values or performs a user-specified function. The user is expected to specify the types of the output key and the output value of the map and reduce functions. The output of the reduce function is collected to the output files on the disk by the Hadoop framework.

Hadoop Distributed File System (HDFS)

Hadoop can work directly with any mountable distributed file system, but the most common file system used by Hadoop is the Hadoop Distributed File System (HDFS). It is a fault-tolerant distributed file system that is designed for commonly available hardware. It is well-suited for large data sets due to its high throughput access to application data.[3]

HDFS has the following features:[3]
  • Hadoop is designed to run on clusters of machines.[3]
  • HDFS can handle large data sets.
  • Since HDFS deals with large scale data, it supports a multitude of machines.
  • HDFS provides a write-once-read-many access model.
  • HDFS is built using the Java language making it portable across various platforms.

The main issue while dealing with large scale data is moving the data while performing computations which requires a lot of bandwidth. This can be simplified by shifting the computation closer to the data location, rather than moving the data closer to the application.
  1. ^ Hadoop: The Definitive Guide by Tom White, First Edition [page 153]
  2. ^ Pro Hadoop by Jason Venner [page 6]
  3. ^ http://hadoop.apache.org/common/docs/current/hdfs_design.html