Data Prediction and Intervention Effect Analysis Based on Random Forest and DID Algorithm

Yue Zhong; Bin Wu

doi:10.62051/1mh1r427

Authors

Yue Zhong
Bin Wu

DOI:

https://doi.org/10.62051/1mh1r427

Keywords:

SHAP; PageRank; PSM; DID; random forest model; negative binomial regression model.

Abstract

In this paper, negative binomial regression model, random forest model, PageRank algorithm and PSM-DID algorithm are proposed, focusing on the application of multi-models in the prediction of count data, the assessment of importance of feature variables, the analysis of association networks and the quantification of intervention effects. First, to address the over-dispersion of count data, a negative binomial regression model is used, which solves the limitations of traditional Poisson regression and realizes the effective prediction of count variables by introducing negative binomial distribution modeling. Second, the random forest model is constructed and combined with the SHAP method, based on the principle of additive feature attribution, the marginal contribution of the feature variables is weighted to quantify the degree of influence of each variable on the prediction results, to realize the modeling of the complex nonlinear relationship and the importance ranking of the features. Then, the node association network is constructed based on the PageRank algorithm, and the ordering of node potential in the network is realized by defining a stochastic wandering model with a damping factor and iteratively calculating the smooth distribution value of the nodes; finally, the PSM-DID algorithm is utilized to quantify the net effect of interventions by eliminating the selection bias through the propensity score matching and stripping the influence of the temporal trend in conjunction with the double-difference method. These methods can effectively handle count data, nonlinear relationships and network structure data, enhance the stability of analysis results through complementary validation between models, and provide a structured quantitative analysis framework for data modeling and causal inference in multiple fields.

Downloads

Download data is not yet available.

References

[1] Zhu Yin. An empirical analysis of the factors affecting the Olympic medal table - taking the 31st Olympic Games as an example [J]. Journal of Chifeng University (Natural Science Edition), 2017, 33(03): 123-127. DOI: 10.13398/j.cnki.issn1673-260x.2017.03.048.

[2] Wang Qiaoyu. Zero-expansion Poisson-negative binomial mixed counting model and its application[D]. Northwest Normal University, 2024.DOI:10.27410/d.cnki.gxbfu.2024.002528.

[3] Wu Yan. A malicious encrypted traffic prediction model integrating random forest and SHAP[J]. Journal of Harbin University of Commerce (Natural Science Edition),2024,40(02): 167-178.DOI: 10.19492 /j.cnki.1672-0946.2024.02.014.

[4] Zhang Bingtao,Wei Dan,Shen Yu,et al. An improved K-mean clustering algorithm based on PageRank[J]. Journal of Beijing University of Posts and Telecommunications,2025,48(02): 18-27.DOI:10. 13190/j. jbupt.2023-098.

[5] Guo Xuchang. A study on the coaching behavior of competitive sports coaches [D]. Fujian Normal University, 2008.

[6] Xia Xulan. An empirical analysis of the impact of the eastern region's leading development strategy on economic development based on PSM-DID [D]. Heilongjiang University, 2024. DOI: 10.27123/ d.cnki.ghlju.2024.001171.