Applying Data Mining Techniques Over Big Data

Source: Applying Data Mining Techniques Over Big Data

With rapid development of information technology, data flows in different variety of formats – sensors data, tweets, photos, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today Internet is about 1.8 Zettabytes (Zettabytes is 10^21), and this number will reach 35 Zettabytes by 2020. With that, data management systems are not able to scale to this huge amount of raw, unstructured data, which what is called today big data. In this present study, we show the basic concept and design of big data tools, algorthims [sic] and techniques. We compare the classical data mining algorithms with big data algorthims [sic] by using hadoop/MapReuce [sic] as the core implemention [sic] of big-data for scalable algorthims. [sic] We implemented K-means and A-priori algorthim [sic] by using Hadoop/MapReduce on 5 nodes cluster of hadoop. We also show their performance for Gigabytes of data. Finally, we explore NoSQL (Not Only SQL) databases for semi-structured, massively large-scale of data using MongoDB as an example. Then, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data stores for these two algorithms.

View all posts