Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection

Zhenglin Li; Haibei Zhu; Houze Liu; Jintong Song; Qishuo Cheng

doi:10.62051/ijcsit.v2n1.01

Authors

Zhenglin Li
Haibei Zhu
Houze Liu
Jintong Song
Qishuo Cheng

DOI:

https://doi.org/10.62051/ijcsit.v2n1.01

Keywords:

Malware detection; Machine learning; Mal-API-2019 dataset; Cybersecurity threats

Abstract

This study conducts a thorough examination of malware detection using machine learning techniques, focusing on the evaluation of various classification models using the Mal-API-2019 dataset. The aim is to advance cybersecurity capabilities by identifying and mitigating threats more effectively. Both ensemble and non-ensemble machine learning methods, such as Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks, are explored. Special emphasis is placed on the importance of data pre-processing techniques, particularly TF-IDF representation and Principal Component Analysis, in improving model performance. Results indicate that ensemble methods, particularly Random Forest and XGBoost, exhibit superior accuracy, precision, and recall compared to others, highlighting their effectiveness in malware detection. The paper also discusses limitations and potential future directions, emphasizing the need for continuous adaptation to address the evolving nature of malware. This research contributes to ongoing discussions in cybersecurity and provides practical insights for developing more robust malware detection systems in the digital era.

Downloads

Download data is not yet available.

References

Dai, W., Tao, J., Yan, X., Feng, Z., & Chen, J. (2023, November). Addressing Unintended Bias in Toxicity Detection: An LSTM and Attention-Based Approach. In 2023 5th International Conference on Artificial Intelligence and Computer Applications (ICAICA) (pp. 375-379). IEEE.

Shen, Z., Wei, K., Zang, H., Li, L., & Wang, G. (2023). The Application of Artificial Intelligence to The Bayesian Model Algorithm for Combining Genome Data. Academic Journal of Science and Technology, 8(3), 132-135.

Catak, F. O., & Yazı, A. F. (2019). A benchmark API call dataset for windows PE malware classification. arXiv preprint arXiv:1905.01999.

Jin, X., Manandhar, S., Kafle, K., Lin, Z., & Nadkarni, A. (2022, November). Understanding iot security from a market-scale perspective. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (pp. 1615-1629).

Wang, L., Lauriola, I., & Moschitti, A. (2023). Accurate training of web-based question answering systems with feedback from ranked users.

Wang, L., & Carvalho, L. (2023). Deviance matrix factorization. Electronic Journal of Statistics, 17(2), 3762-3810.

Cohan, A., Beltagy, I., King, D., Dalvi, B., & Weld, D. S. (2019). Pretrained language models for sequential sentence classification. arXiv preprint arXiv:1909.04054.

Catak, F. O., Yazı, A. F., Elezaj, O., & Ahmed, J. (2020). Deep learning based Sequential model for malware analysis using Windows exe API Calls. PeerJ Computer Science, 6, e285.

Louppe, G. (2014). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502.

Lyu, W., Zheng, S., Ma, T., & Chen, C. (2022). A study of the attention abnormality in trojaned BERTs. arXiv preprint arXiv:2205.08305.

Jin, X., & Wang, Y. (2023). Understand Legal Documents with Contextualized Large Language Models. arXiv preprint arXiv:2303.12135.

Wang, H., Li, Q., & Liu, Y. (2023). Adaptive supervised learning on data streams in reproducing kernel Hilbert spaces with data sparsity constraint. Stat, 12(1), e514.

Wu, J., Ye, X., & Man, Y. (2023, May). Bottrinet: A unified and efficient embedding for social bots detection via metric learning. In 2023 11th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1-6). IEEE.

Lyu, W., Zheng, S., Ma, T., Ling, H., & Chen, C. (2022). Attention Hijacking in Trojan Transformers. arXiv preprint arXiv:2208.04946.

Yan, X., Xiao, M., Wang, W., Li, Y., & Zhang, F. (2024). A Self-Guided Deep Learning Technique for MRI Image Noise Reduction. Journal of Theory and Practice of Engineering Science, 4(01), 109-117.

Kuo, J. J., Ching, C. W., Huang, H. S., & Liu, Y. C. (2021). Energy-efficient topology construction via power allocation for decentralized learning via smart devices with edge computing. IEEE Transactions on Green Communications and Networking, 5(4), 1806-1819.

Huang, X., Zhang, Z., Guo, F., Wang, X., Chi, K., & Wu, K. (2024). Research on Older Adults' Interaction with E-Health Interface Based on Explainable Artificial Intelligence. arXiv preprint arXiv:2402.07915.

Han, S., Wu, J., Xu, E., He, C., Lee, P. P., Qiang, Y., ... & Li, R. (2019). Robust data preprocessing for machine-learning-based disk failure prediction in cloud production environments. arXiv preprint arXiv:1912.09722.

Wu, J., Ye, X., & Mou, C. (2023). Botshape: A novel social bots detection approach via behavioral patterns. arXiv preprint arXiv:2303.10214.

Ye, X., Wu, J., Mou, C., & Dai, W. (2023). MedLens: Improve mortality prediction via medical signs selecting and regression interpolation. arXiv preprint arXiv:2305.11742.

Zhuang, J., & Al Hasan, M. (2021). Non-exhaustive Learning Using Gaussian Mixture Generative Adversarial Networks. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part II 21 (pp. 3-18). Springer International Publishing.

Mohamed, N. (2023). Current trends in AI and ML for cybersecurity: A state-of-the-art survey. Cogent Engineering, 10(2), 2272358.

He, W., Vu, M. N., Jiang, Z., & Thai, M. T. (2022, December). An explainer for temporal graph neural networks. In GLOBECOM 2022-2022 IEEE Global Communications Conference (pp. 6384-6389). IEEE.

Su, J., Jiang, C., Jin, X., Qiao, Y., Xiao, T., Ma, H., ... & Lin, J. (2024). Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review. arXiv preprint arXiv:2402.10350.

Gong, Y., Zhu, M., Huo, S., Xiang, Y., & Yu, H. (2024). Enhancing Cybersecurity Resilience in Finance with Deep Learning for Advanced Threat Detection. arXiv preprint arXiv:2402.09820.

Zhu, M., Gong, Y., Xiang, Y., Yu, H., & Huo, S. (2024). Utilizing GANs for Fraud Detection: Model Training with Synthetic Transaction Data. arXiv preprint arXiv:2402.09830.

Yang, D., Kumar, A., Ray, S., Wang, W., & Tourani, R. (2023, July). IoT Sentinel: Correlation-based Attack Detection, Localization, and Authentication in IoT Networks. In 2023 32nd International Conference on Computer Communications and Networks (ICCCN) (pp. 1-10). IEEE

Dong, X., Dang, B., Zang, H., Li, S., & Ma, D. (2024). The prediction trend of enterprise financial risk based on machine learning arima model. Journal of Theory and Practice of Engineering Science, 4(01), 65-71.

Liu, S., & Zhu, M. (2022). Distributed Inverse Constrained Reinforcement Learning for Multi-agent Systems. Advances in Neural Information Processing Systems, 35, 33444-33456.

Liu, S., & Zhu, M. (2024). Learning Multi-agent Behaviors from Distributed and Streaming Demonstrations. Advances in Neural Information Processing Systems, 36.

Tian, J., Shen, C., Wang, B., Xia, X., Zhang, M., Lin, C., & Li, Q. (2024). LESSON: Multi-Label Adversarial False Data Injection Attack for Deep Learning Locational Detection. IEEE Transactions on Dependable and Secure Computing.

Tian, J., Wang, B., Li, J., Wang, Z., Ma, B., & Ozay, M. (2022). Exploring targeted and stealthy false data injection attacks via adversarial machine learning. IEEE Internet of Things Journal, 9(15), 14116-14125.

Liu, B., Zhao, X., Hu, H., Lin, Q., & Huang, J. (2023). Detection of Esophageal Cancer Lesions Based on CBAM Faster R-CNN. Journal of Theory and Practice of Engineering Science, 3(12), 36-42.

Liu, B., Yu, L., Che, C., Lin, Q., Hu, H., & Zhao, X. (2023). Integration and Performance Analysis of Artificial Intelligence and Computer Vision Based on Deep Learning Algorithms. arXiv preprint arXiv:2312.12872.