Research on AI-Generated Text Detection Based on Machine Learning Models

Moran Zeng

doi:10.62051/8k1jga32

Authors

Moran Zeng

DOI:

https://doi.org/10.62051/8k1jga32

Keywords:

Support Vector Machine (SVM), Logistic Regression, Naive Bayes (NB) Classifier, Comparison, improvements, suggestions.

Abstract

The purpose of this research is to ensure the authenticity of information, guarantee the reliability and credibility of information sources, and prevent the spread of false information, fabricated data, and misleading content. In the academic field, detecting AI-generated papers, articles, and assignments helps maintain academic integrity, prevent academic fraud and plagiarism, and thus improve academic capabilities. This study summarize the characteristics of the three selected models, which are Logistic Regression, Support Vector Machine (SVM), and Naive Bayes (NB) Classifier. And provide recommendations and directions for improvement in the choice of detection models for AI-generated content. Through comparison of three models—logistic regression, SVM, and Naive Bayes—on the same dataset in terms of Accuracy, Precision, and F1-score, it is determined that logistic regression performs the best for this type of dataset. Logistic regression achieves superior performance with metrics exceeding 90%. SVM shows suitability for large datasets with metrics around 70% in this dataset. However, Naive Bayes, typically suitable for smaller datasets, performs poorly on this dataset, achieving only 50% accuracy.

Downloads

Download data is not yet available.

References

[1] Xibin S, Lilei W. Research on the detection of ai-generated academic journal texts. Science and Publication, 2023, (08): 56-62.

[2] Tangermann V. 89 Percent of college students admit to using ChatGPT for homework, study claims wait, what!? [2023-04-27]. Available at: https://futurism.com/the-byte/students-admit-chatgpt-homework.

[3] Zhou M. Technical Defects of AIGC Paper Detection System and Responses of Academic Journals. Publishing and Printing, 2024, 1-10.

[4] Yibo W, Xin G, Zhifeng L, et al. Detection and comparative study of ai-generated and scholar-written chinese paper abstracts: a case study in library science. Journal of Information, 2024, 1-8.

[5] Hanxia Z. Scenario analysis suitable for linear regression and logistic regression. Automation & Instrumentation, 2022 (10): 1-4+8.

[6] Nibbering D, Hastie T J. Multiclass-penalized logistic regression. Computational Statistics & Data Analysis, 2021.

[7] Shifei D, Yuting S, Zhizhen L, et al. Review of support vector machine algorithm under weak supervision scenarios. Journal of Computer Science, 2024, 1-25.

[8] Bowen Z. Research on text classification algorithm based on naive bayes method. Xiangtan University, 2021.

[9] Sang X, He J, Chen M. Real-time monitoring and modeling of online public opinion based on tf-idf and lsi models. Mathematics in Practice and Theory, 2022, 52(11): 56-66.

[10] Hall M. A decision tree-based attribute weighting filter for naive bayes. Knowledge-Based Systems, 2007, 20: 120–126.