Research on genetic disease identification and classification model based on combined sampling and random forest feature selection

Authors

  • Jianzhang Li

DOI:

https://doi.org/10.62051/7nttaq96

Keywords:

Gene Identification; Genetic Disease Testing; Class Imbalance; Curse of Dimensionality.

Abstract

Machine learning classification models have been widely used in gene identification tasks. However, high-dimensional gene loci and label imbalance are still urgent problems to be solved. To this end, this paper proposes a classification model that combines combinatorial sampling and random forest feature selection. This method applies random forest to randomly selected subsamples for feature selection, aiming to retain effective prediction information and alleviate the impact of the curse of dimensionality. In addition, the random subsamples after feature selection can be used in combination with any classification model. Experiments show that the proposed method is superior to a variety of classic classification models in terms of accuracy and efficiency.

Downloads

Download data is not yet available.

References

[1] Bichuan Liu. Exploration of pathogenic loci of genetic diseases and research on a type of high-dimensional small sample problem [D]. Nanjing: Nanjing Normal University, 2020.

[2] Zhengqiang Li. Screening and identification of gene loci considering interference information [D]. Guangxi: Guangxi Normal University, 2020.

[3] Meng Li. Research on data classification of hospital information system based on improved PSO and fuzzy decision tree [J]. Microcomputer Applications, 2024, 40(09): 194-196+201.

[4] Ning Hao. Machine learning classification and regression model prediction of human carcinogenicity and endocrine disrupting toxicity of typical organic chemicals [D]. Jilin: Jilin University, 2024.

[5] Huotari M, Främling K. Event Classification with Imbalanced and Missing Data for an Air-Handling Unit[C]//2022 IEEE 5th International Conference on Big Data and Artificial Intelligence (BDAI). IEEE, 2022: 82-86.

[6] Peng M, Zhang Q et al. Trainable undersampling for class-imbalance learning[C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33(01): 4707-4714.

[7] Mohammed R, Rawashdeh J, Abdullah M. Machine learning with oversampling and undersampling techniques: overview study and experimental results[C]//2020 11th international conference on information and communication systems (ICICS). IEEE, 2020: 243-248.

[8] Leygonie R, Lobry S, Vimont G, et al. Transforming Multidimensional Data into Images to Overcome the Curse of Dimensionality[C]//2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023: 700-704.

[9] Aremu O O, Hyland-Wood D, McAree P R. A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data[J]. Reliability Engineering & System Safety, 2020, 195: 106706.

[10] Chandra N K, Canale A, Dunson D B. Esca** the curse of dimensionality in Bayesian model-based clustering[J]. Journal of machine learning research, 2023, 24(144): 1-42.

[11] Elreedy D, Atiya A F. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance[J]. Information Sciences, 2019, 505: 32-64.

[12] Ahmed M, Seraj R, Islam S M S. The k-means algorithm: A comprehensive survey and performance evaluation[J]. Electronics, 2020, 9(8): 1295.

[13] Parmar A, Katariya R, Patel V. A review on random forest: An ensemble classifier[C]//International conference on intelligent data communication technologies and internet of things (ICICI) 2018. Springer International Publishing, 2019: 758-763.

Downloads

Published

25-11-2024

How to Cite

Li, J. (2024) “Research on genetic disease identification and classification model based on combined sampling and random forest feature selection”, Transactions on Computer Science and Intelligent Systems Research, 7, pp. 685–693. doi:10.62051/7nttaq96.