Research on a Chinese Text Information Density Evaluation Model Fusing Semantic and Statistical Features

Zhaoyang Ye

doi:10.62051/ggfj0v47

Authors

Zhaoyang Ye

DOI:

https://doi.org/10.62051/ggfj0v47

Keywords:

Information Density; Chinese Text Evaluation; Semantic Features; Statistical Features; Deep Learning; Fusion Model.

Abstract

To address the issue in Chinese text information content evaluation where traditional methods primarily rely on statistical features and overlook semantic and structural complexity, this study proposes a Chinese text information density evaluation model that fuses semantic and statistical features. The model adopts a dual-channel fusion architecture: the semantic channel utilizes the pre-trained language model BERT to extract deep contextual embeddings of the text, combined with a Bidirectional Long Short-Term Memory network (BiLSTM) to capture long-range semantic dependencies; the statistical channel integrates Term Frequency-Inverse Document Frequency (TF-IDF), part-of-speech (POS) distribution, and dependency relations. These two types of heterogeneous features are concatenated and then fed into a fusion gating module for effective combination and non-linear interaction. Finally, a regression layer outputs a standardized information density score. For model training and evaluation, this study operationalized the definition of information density and constructed a manually annotated dataset comprising 210 diverse Chinese texts. Experimental results demonstrate that the proposed model significantly outperforms various baseline models across all evaluation metrics, validating the effectiveness and superiority of the fusion model for the task of Chinese text information density evaluation. This research provides a new analytical tool for applications such as text quality assessment and high-value information localization.

Downloads

Download data is not yet available.

References

[1] Shannon C E. A mathematical theory of communication[J/OL]. The Bell System Technical Journal, 1948, 27(3): 379-423. DOI:10.1002/j.1538-7305.1948.tb01338.x.

[2] Schramm W. Information Theory and Mass Communication[J/OL]. Journalism Quarterly, 1955, 32(2): 131-146. DOI:10.1177/107769905503200201.

[3] QIAN Chen，HUANG Wei-dong. Model of Weibo Event Trend Based on Information Quantity [J/OL]. Information Science, 2019, 37(2): 46-51. DOI:10.13833/j.issn.1007-7634.2019.02.008.

[4] Wang Zheng, Wang Linsen, Zhao Lei.Research on the Microblog Bursty Topic Detection Model Based on Information Density [J/OL]. Information Studies:Theory & Application, 2016, 39(3): 125-129. DOI:10.16353/j.cnki.1000-7490.2016.03.025.

[5] Crocker M W, Demberg V, Teich E. Information Density and Linguistic Encoding (IDeaL)[J/OL]. KI - Künstliche Intelligenz, 2016, 30(1): 77-81. DOI:10.1007/s13218-015-0391-y.

[6] Liu X, Li F, Xiao W. Measuring linguistic complexity in Chinese: An information-theoretic approach[J/OL]. Humanities and Social Sciences Communications, 2024, 11(1): 1-12. DOI:10.1057/s41599-024-03510-7.

[7] Mekheimer A M, Fageeh I A. Prioritizing information over grammar: a behavioral investigation of information density and rhetorical discourse effects on EFL listening comprehension[J]. Discover Education,2025,4(1):24-24.

[8] Sweller J. CHAPTER TWO - Cognitive Load Theory[M/OL]//Mestre J P, Ross B H. Psychology of Learning and Motivation: 55. Academic Press, 2011: 37-76[2025-05-08]. https://www.sciencedirect.com/science/article/pii/B9780123876911000028. DOI:10.1016/B978-0-12-387691-1.00002-8.

[9] Miller G A, Beckwith R, Fellbaum C, et al. Introduction to WordNet: An On-line Lexical Database*[J/OL]. International Journal of Lexicography, 1990, 3(4): 235-244. DOI:10.1093/ijl/3.4.235.

[10] Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J/OL]. Journal of the American Society for Information Science, 1990, 41(6): 391-407. DOI:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

[11] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[A/OL]. arXiv, 2013[2025-05-10]. http://arxiv.org/abs/1301.3781. DOI:10.48550/arXiv.1301.3781.

[12] Pennington J, Socher R, Manning C. GloVe: Global Vectors for Word Representation[C/OL]//Moschitti A, Pang B, Daelemans W. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014: 1532-1543[2025-05-10]. https://aclanthology.org/D14-1162/. DOI:10.3115/v1/D14-1162.

[13] Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations[A/OL]. arXiv, 2018[2025-05-10]. http://arxiv.org/abs/1802.05365. DOI:10.48550/arXiv.1802.05365.

[14] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C/OL]//Burstein J, Doran C, Solorio T. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186[2025-05-10]. https://aclanthology.org/N19-1423/. DOI:10.18653/v1/N19-1423.

[15] SPARCK JONES K. A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL[J/OL]. Journal of Documentation, 1972, 28(1): 11-21. DOI:10.1108/eb026526.

[16] Luhn H P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J/OL]. IBM Journal of Research and Development, 1957, 1(4): 309-317. DOI:10.1147/rd.14.0309.

[17] Santorini B. Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)[C/OL]. 1990[2025-05-11]. https://www.semanticscholar.org/paper/Part-of-Speech-Tagging-Guidelines-for-the-Penn-(3rd-Santorini/a145854ede2f62098bf4e92de1584ab270b676c9.

[18] Marcus M, Kim G, Marcinkiewicz M A, et al. The Penn Treebank: annotating predicate argument structure[C/OL]//Proceedings of the workshop on Human Language Technology. USA: Association for Computational Linguistics, 1994: 114-119[2025-05-10]. https://dl.acm.org/doi/10.3115/1075812.1075835. DOI:10.3115/1075812.1075835.

[19] Nivre J, de Marneffe M C, Ginter F, et al. Universal Dependencies v1: A Multilingual Treebank Collection[C/OL]//Calzolari N, Choukri K, Declerck T, et al. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC`16). Portorož, Slovenia: European Language Resources Association (ELRA), 2016: 1659-1666[2025-05-11]. https://aclanthology.org/L16-1262/.

[20] Zhang S, Wang L, Sun K, et al. A Practical Chinese Dependency Parser Based on A Large-scale Dataset[A/OL]. arXiv, 2020[2025-05-11]. http://arxiv.org/abs/2009.00901. DOI:10.48550/arXiv.2009.00901.

[21] Zhang T, Kishore V, Wu F, et al. BERTScore: Evaluating Text Generation with BERT[A/OL]. arXiv, 2020[2025-05-20]. http://arxiv.org/abs/1904.09675. DOI:10.48550/arXiv.1904.09675.

[22] Horne B D, Adali S. This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News[A/OL]. arXiv, 2017[2025-05-01]. http://arxiv.org/abs/1703.09398. DOI:10.48550/arXiv.1703.09398.