Comparative Evaluation of GPT, BERT, and XLNet: Insights into Their Performance and Applicability in NLP Tasks

Chuxi Zhou

doi:10.62051/h08exg91

Authors

Chuxi Zhou

DOI:

https://doi.org/10.62051/h08exg91

Keywords:

Natural Language Processing (NLP); GPT; BERT; XLNet.

Abstract

Natural Language Processing (NLP) is a pivotal area in artificial intelligence, aiming to make computers capable of understanding and generating human language. This study evaluates and compares three prominent NLP models—the Generative Pre-trained Transformer (GPT) model, Bidirectional Encoder Representations from Transformers (BERT) model, and Generalized Autoregressive Pretraining for Language Understanding (XLNet)—to determine their strengths, limitations, and suitability for various tasks. The research involves a comprehensive analysis of these models, utilizing well-established datasets such as the Stanford Question Answering Dataset (SQuAD), General Language Understanding Evaluation (GLUE), Reading Comprehension from Examinations (RACE), and the Situations with Adversarial Generations (SWAG). The study explores each model's architecture, pre-training, and fine-tuning processes: GPT’s unidirectional approach is assessed for its language generation and handling of long-range dependencies; Bidirectional encoding is examined for its effectiveness in context understanding, and XLNet permutation-based training is analyzed for its robust contextual comprehension. The experimental results reveal that GPT excels in generative tasks but is constrained by its unidirectional nature. BERT achieves superior accuracy in comprehension tasks but is computationally demanding and susceptible to pre-training bias. XLNet outperforms both GPT and BERT in accuracy and contextual understanding, though at the cost of increased complexity. The results offer a significant understanding of the effectiveness and applicability of these models, suggesting future research directions such as hybrid models and improvements in efficiency.

Downloads

Download data is not yet available.

References

[1] Fanni S.C. Febi M. Aghakhanyan G. et al. Natural language processing. Introduction to Artificial Intelligence. Cham: Springer International Publishing, 2023: 87 - 99.

[2] Kalyanathaya K.P. Akila D. Rajesh P. Advances in natural language processing–a survey of current research trends, development tools and industry applications. International Journal of Recent Technology and Engineering, 2019, 7 (5C): 199 - 202.

[3] Ittoo A. van den B.A. Text analytics in industry: Challenges, desiderata and trends. Computers in Industry, 2016, 78: 96 - 107.

[4] Jordan M.I. Serial order: A parallel distributed processing approach. Advances in psychology. North-Holland, 1997, 121: 471 - 495.

[5] Vaswani A. Shazeer N. Parmar N. et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.

[6] Radford A. Narasimhan K. Salimans T. et al. Improving language understanding by generative pre-training. 2018.

[7] Devlin J. Chang M.W. Lee K. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018, arXiv preprint: 1810.04805.

[8] Yang Z. Dai Z. Yang Y. et al. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 2019, 32.

[9] Wang A. Singh A. Michael J. et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2018, arXiv preprint: 1804.07461.

[10] Rajpurkar P. Zhang J. Lopyrev K. et al. Squad: 100,000+ questions for machine comprehension of text. 2016, arXiv preprint: 1606.05250.

[11] Lai G. Xie Q. Liu H. et al. Race: Large-scale reading comprehension dataset from examinations. 2017, arXiv preprint: 1704.04683.

[12] Mostafazadeh N. Roth M. Louis A. et al. Lsdsem 2017 shared task: The story cloze test. Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics. Association for Computational Linguistics, 2017: 46 - 51.

[13] Zellers R. Bisk Y. Schwartz R. et al. Swag: A large-scale adversarial dataset for grounded commonsense inference. 2018, arXiv preprint: 1808.05326.

[14] Taylor W.L. Cloze procedure: A new tool for measuring readability. Journalism quarterly, 1953, 30 (4): 415 - 433.