Comparing Linear Regression and Random Forest for Housing Price Prediction: Insights from the Boston Housing Dataset
DOI:
https://doi.org/10.62051/d5x1ss82Keywords:
Housing price prediction; Boston housing dataset; machine learning.Abstract
Housing price prediction is a critical task in real estate and economic analysis, providing valuable insights for stakeholders such as homebuyers, sellers, and policymakers. This study focuses on the Boston Housing dataset, a benchmark with 505 samples and 14 features, to predict the median value of owner-occupied homes (MEDV) using Linear Regression and Random Forest Regression. Exploratory data analysis reveals non-linear patterns, such as the right-skewed distribution of MEDV (skewness = 1.11) and strong correlations with features like LSTAT (-0.74) and RM (0.70). The dataset was standardized and split into 80-20 training and testing sets for model evaluation. Results show that Random Forest outperforms Linear Regression, achieving an MSE of 7.58 and R² of 0.864 compared to 19.38 and 0.652, respectively. Feature importance analysis highlights LSTAT and RM as key predictors, emphasizing socio-economic and structural influences. While Random Forest excels in capturing non-linear relationships, Linear Regression offers interpretability for policy insights. However, the dataset’s historical context and small size limit its applicability to modern markets, suggesting future research with larger, contemporary datasets and advanced models.
Downloads
References
[1] Harrison D, Rubinfeld D L. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 1978, 5(1): 81–102.
[2] Rosen S. Hedonic prices and implicit markets: Product differentiation in pure competition. Journal of Political Economy, 1974, 82(1): 34–55.
[3] Kain J F, Quigley J M. Housing markets and racial discrimination: A microeconomic analysis. Journal of Urban Economics, 1976, 3(3): 225–245.
[4] Mullainathan S, Spiess J. Machine learning: An applied econometric approach. Journal of Economic Perspectives, 2017, 31(2): 87–106.
[5] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2014, 70(5): 849–911.
[6] Breiman L. Random forests. Machine Learning, 2001, 45(1): 5–32.
[7] Liaw A, Wiener M. Classification and regression by randomForest. R News, 2002, 2(3): 18–22.
[8] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer, 2009.
[9] Bin O, Kruse J B. Real estate market response to coastal flood hazards. Natural Hazards Review, 2019, 8(4): 121–132.
[10] Montgomery D C, Peck E A, Vining G G. Introduction to Linear Regression Analysis. John Wiley & Sons, 2021.
Downloads
Published
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Transactions on Computer Science and Intelligent Systems Research

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







