Enhancing Spam Email Detection with Machine Learning: A Comparative Study of Logistic Regression and Naive Bayes Using Apache Spark

Authors

  • Zhaoyang Ye

DOI:

https://doi.org/10.62051/gt8zn492

Keywords:

Spam; Machine Learning; Naive Bayes; Logistic Regression; Apache Spark.

Abstract

The spread of spam emails presents serious problems for both email security and user experience. This research aims to develop an effective spam email classification system utilizing machine learning techniques, specifically Logistic Regression and Naive Bayes, within the Apache Spark framework. The methodology encompasses a thorough preprocessing of the Enron email dataset. This process involves several critical steps: text cleaning to remove irrelevant information, tokenization to break down the text into individual words, removal of stop words to eliminate common but uninformative words, and text feature extraction using Term Frequency-Inverse Document Frequency (TF-IDF) to quantify the importance of terms within the dataset. The study is conducted on a subset of the Enron email dataset, comprising 11,029 emails, with 2,996 labeled as spam. Experimental results demonstrate that the Naive Bayes model outperforms Logistic Regression, achieving higher accuracy and F1 score. This finding underscores the robustness of Naive Bayes in spam email classification, highlighting its potential for enhancing email security by effectively filtering spam.

Downloads

Download data is not yet available.

References

[1] Paswan, M. Kumar, P. Shanthi Bala, and G. Aghila. Spam filtering: Comparative analysis of filtering techniques. IEEE-International Conference on Advances in Engineering, Science and Management, (2012).

[2] J. Doshi, K. Parmar, R. Sanghavi, et al. A comprehensive dual-layer architecture for phishing and spam email detection [J]. Computers & Security, 133 (2023), 103378.

[3] Rao, Sanjeev, A. Kumar Verma, and T. Bhatia. A review on social spam detection: Challenges, open issues, and future directions. Expert Systems with Applications, 186 (2021), 115742.

[4] Labonne, Maxime, and S. Moran. Spam-t5: Benchmarking large language models for few-shot email spam detection. arXiv preprint:2304.01238 (2023).

[5] Dada, E. Gbenga, et al. Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6) (2019).

[6] A.N.M. JayaLakshmi, and K.V. Krishna Kishore. Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib. Journal of King Saud University-Computer and Information Sciences, 34(1) (2022), 1311-1319.

[7] Park, WooHyun, N. Muhammad Faseeh Qureshi, and D. Ryeol Shin. Pseudo NLP Joint Spam Classification Technique for Big Data Cluster. Computers, Materials & Continua, 71(1) (2022).

[8] L. Yekai, et al. SpamDam: Towards Privacy-Preserving and Adversary-Resistant SMS Spam Detection. arXiv preprint:2404.09481 (2024).

[9] E. Mohamed, A. Kotha, and A. Matrawy. Introducing Adaptive Continuous Adversarial Training (ACAT) to Enhance ML Robustness. arXiv preprint:2403.10461 (2024).

[10] Information on: http://www.kaggle.com/wcukierski/enron-email-dataset.

Downloads

Published

25-11-2024

How to Cite

Ye, Z. (2024) “Enhancing Spam Email Detection with Machine Learning: A Comparative Study of Logistic Regression and Naive Bayes Using Apache Spark”, Transactions on Computer Science and Intelligent Systems Research, 7, pp. 78–85. doi:10.62051/gt8zn492.