Nt1310 Unit 3.4 Data Analysis

2265 Words10 Pages

3.4 Displaying meaningful results

Plotting points on a graph for analysis becomes difficult when dealing with extremely large amounts of information or a variety of categories of information. For example, imagine you have 10 billion rows of retail SKU data that you are trying to compare. The user trying to view 10 billion plots on the screen will have a hard time seeing so many data points. One way to resolve this is to cluster data into a higher-level view where smaller groups of data become visible. By grouping the data together, or “binning,” you can more effectively visualize the data.

3.5 Dealing with outliers

The graphical representations of data made possible by visualization can communicate trends and outliers much faster than tables …show more content…

HDFS runs across the nodes in a Hadoop cluster and together connects the file systems on many input and output data nodes to make them into one big file system. The present Hadoop ecosystem, as shown in Figure 1, consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related components such as Apache Hive, Hbase, Oozie, Pig and Zoo keeper and these components are explained as below …show more content…

Sqoop: A project for transferring/importing data between relational databases and Hadoop.

Oozie: An orchestration and work flow management for dependent Hadoop jobs.

Figure 2 gives an overview of the Big Data analysis tools which are used for efficient and precise data analysis and management jobs. The Big Data Analysis and management setup can be understood through the layered structured defined in the figure. The data storage part is dominated by the HDFS distributed file system architecture and other architectures available are Amazon Web Service, HBase and Cloud Store etc. The data processing tasks for all the tools is Map Reduce and it is the Data processing tool which effectively used in the Big Data Analysis[13].

For handling the velocity and heterogeneity of data, tools like Hive, Pig and Mahout are used which are parts of Hadoop and HDFS framework. It is interesting to note that for all the tools used, Hadoop over HDFS is the underlying architecture. Oozie and

EMR with Flume and Zoo keeper are used for handling the volume and veracity of data, which are standard Big Data management tools [13].

Figure 1 : Hadoop Architecture

More about Nt1310 Unit 3.4 Data Analysis