Pt1420 Unit 3 Problem Analysis Paper

269 Words2 Pages

Hadoop MapReduce is a framework for processing large data sets in parallel across Hadoop cluster. It uses two steps Map and Reduce process. An initiated job has a map and reduces phases. The map phase counts the words in each document, then reduce phase aggregates the per document data into word counts spanning the entire collection. The reduce phase use the results from map tasks as input to a set of parallel reduce tasks and it consolidate the data into final result. The Map-Reduce uses the key value pairs as input and produce a set of output key value pairs. The key value pairs in the input set (KV1), can be mapped on the cluster and produce the (KV2) as output, to the reduce phase finally the set of operation and make new output set (KV3). …show more content…

The mapper extracts the support the call identifier (pass to reducer as a key) and the support call description (pass to the reducer as the value). Each map tasks receives a subset of the initial centroids and is responsible for assign each input data point, to the nearest centroid (cluster). Every time the mapper generate the a key / value pair , where the key is the cluster identifier and the value corresponds to the coordinates of the point. The algorithm uses a combiner to reduce the amount of the data to be transferred from the mapper to the reducer. The Hadoop system follow the different task before approaching the map/reduce

More about Pt1420 Unit 3 Problem Analysis Paper