Hadoop [8] is an open source implementation of MapReduce programming model which runs in a distributed environment. Hadoop consists of two core components namely Hadoop Distributed File System (HDFS) and the MapReduce programming with the job management framework. HDFS and MapReduce both follow the master-slave architecture. A Hadoop program (client) submits a job to the MapReduce framework through the jobtracker which is running on the master node. The jobtracker assigns the tasks to the tasktrackers running on many slave nodes or on a cluster of machines. The tasktrackers send messages called heartbeats regularly to the jobtracker to update the status, such as alive, idle, busy, etc. If suppose a task fails or times out, or a node is dead, the jobtracker will re-schedule the tasks to run on available nodes automatically. HDFS component consists of a single namenode and multiple datanodes. The namenode maintains the metadata about the data present on each datanode. When a client application reads or writes data into HDFS, it …show more content…
Both Map and Reduce are block operations, in which data transition cannot proceed to the next stage until all the tasks of the current state have been completed. The output of mappers has to be first written into HDFS before being shuffled to the reducers. The shuffling will not begin until all the map tasks have finished, because of the sort-merge based grouping for the intermediate results. The block-level restart, a one-to-one shuffling strategy, and the runtime scheduling reduce the performance of each datanode. The MapReduce framework lacks the performance execution plans like we have in traditional DBMS, and does not optimize data transferring across different nodes. Therefore, Hadoop has a high latency because of its architecture. Hence, it is more suitable for batch jobs than real-time