Large-scale Data Mining Method based on Clustering Algorithm Combined with MAPREDUCE

Yulun Zhang; Chenxu Zhang; Lei Yang; Hongyang Li

doi:10.62051/8p9b3106

Authors

Yulun Zhang
Chenxu Zhang
Lei Yang
Hongyang Li

DOI:

https://doi.org/10.62051/8p9b3106

Keywords:

Data Mining; Clustering Algorithm; Apache; Mapreduce; K-means Algorithm.

Abstract

With the continuous deepening and development of information technology, the diversity and amount of information in data continue to grow. Effectively mining these text data to extract valuable content has become an urgent task in the field of data research. This study combines the MapReduce distributed system with the K-means clustering algorithm to meet the challenges of large-scale data mining. At the same time, the paper use a distributed caching mechanism to solve the problem of repeated application of resources for multiple MapReduce collaborative operations and improve data mining efficiency. The combination of MapReduce's distributed computing and the advantages of K-means clustering algorithm provides an efficient and scalable method for large-scale data mining. Experimental results combining internal and external indicators show that the advantage of combining K-means with MapReduce is to fully utilize the distributed and parallel computing characteristics of MapReduce, providing users with an efficient and scalable data mining tool. Through this research, the paper provide new methods and insights for large-scale data mining, improving the efficiency and accuracy of data mining.

Downloads

Download data is not yet available.

References

Qiao Yuanyuan, Liu Fang, Ling Yan, et al. Resource modeling and implementation of MapReduce in cloud computing environment Performance prediction [1] Journal of Beijing University of Posts and Telecommunications, 2014 (S1): 115-119.

Li Zhenju, Li Xuejun, Yang Sheng, et al. MapReduce model based on multi-stage division [1] Computer Applications, 2015(12): 3374-3377 + 3382.

Frenks B. Ukroshcheniye bol'shikh dannykh. Kak izvlekat' znaniya iz massivov informatsii s pomoshch'yu glubokoy analitiki [The taming of big data. How to extract knowledge from the massive amounts of information using deep Analytics]. Moscow, Mann, Ivanov i Ferber Publ., 2014. 352 p.

M. V. Gladkiy, Belarusian State Technological University, DISTRIBUTED COMPUTING MODEL MAPREDUCE, BSTU 2016.NO_6,194-198.

L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1997.

Liao Bin, Zhang Tao, Yu Jiong, etc. Resource efficiency optimization of big data mining algorithms under the coordination of multiple MapReduce jobs [J]. Computer Application Research, 2020, 37(5): 1321-1325.

Wang Bo, Wang Huaibin, Zhang Chao. Optimization of frequent pattern mining algorithm based on MapReduce[J]. Journal of Tianjin University of Science and Technology, 2018, 34(01): 6-11.

Lu Guo, Xiao Ruixue, Bai Zhenrong, etc. Research on MapReduce parallel clustering optimization algorithm in big data mining [J]. Modern Electronic Technology, 2019, 42(11): 169-172.

MapReduce model design based on heterogeneous computing[J]. Informatization Research, 2015(04): 40-43.

Wan Cong, Wang Cuirong, Wang Cong, et al. Research on load balancing partitioning algorithm of reduce phase in MapReduce model [J]. Small Microcomputer Systems, 2015(02): 240-243.

Liu Wei, Du Yongwen, Lu Xiaojian. Research on MapReduce model scheduling algorithm under Hadoop platform [J]. Journal of Guangxi University for Nationalities (Natural Science Edition), 2014(03): 72-74 + 85.

Zhang Bin, Le Jiajin. MapReduce parallel connection algorithm based on column storage [J]. Computer Engineering, 2014 (08):70-75,85.