Analysis of Parallel Optimisation Strategies Based on MapReduce Models

Shuhang Wu

doi:10.62051/ijcsit.v4n3.39

Authors

Shuhang Wu

DOI:

https://doi.org/10.62051/ijcsit.v4n3.39

Keywords:

MapReduce, Parallel computing, Big data, Spark, Performance tuning, Task scheduling

Abstract

The aim of this paper is to provide an in-depth analysis of parallel analysis strategies for MapReduce models, and to explore how to improve the overall performance by optimising task allocation and scheduling, improving data locality and increasing node utilisation. The research methodology includes an analysis and overview of existing MapReduce frameworks and proposes a series of improvement strategies. These strategies improve the utilisation of computing resources by adjusting the granularity of task division, optimising data slicing and distribution, and improving task scheduling algorithms. The results show that by reasonably optimising the parallel analysis strategy of the MapReduce model, its performance in large-scale dataset processing can be significantly improved, especially in resource-constrained distributed environments. Ultimately, this paper concludes that although the MapReduce model has been used in more mature applications, there is still much room for optimising its parallel strategy when facing larger scale and complex data processing tasks. In the future, further research should be devoted to finer-grained task scheduling, dynamic resource allocation, and more efficient fault-tolerance mechanisms to continuously improve the parallel processing capability of the MapReduce model.

Downloads

Download data is not yet available.

References

[1] Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. in Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI '04) (pp. 137-150).

[2] Abadi, D. J., Boncz, P. A., Harizopoulos, S., Idreos, S., & Madden, S. (2010). Column-stores vs. row-stores: How different are they really? In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (pp. 967-980). ACM.

[3] Oberhumer, M. (1995). LZO real-time data compression library. Retrieved from http://www.oberhumer.com/opensource/lzo/

[4] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2012). Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. in Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI '12) (pp. 15-28). USENIX Association.

[5] Desler, G. A. (1964). A batch processing system. Proceedings of the AFIPS Spring Joint Computer Conference (pp. 115-121).

[6] Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Schneider, S., Yadav, J., Kulkarni, S., Jackson, J., Dai, Y., Baldeschwieler, E., Bhagat, N. Mittal, S., & Ryaboy, D. (2014). Storm@twitter. in *Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data* (pp. 147-156). ACM.

[7] Agarwal, S., Milner, H., Kleiner, A., Talwadker, S., Kavulya, S., Mozafari, B., Madden, S., & Stoica, I. (2013). BlinkDB: Queries with bounded errors and bounded response times on very large data. in *Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13) * (pp. 29-42). ACM.

[8] Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. *IBM Journal of Research and Development, 3*(3), 210-229.