Optimal Machine Learning Algorithms for Predicting the Popularity of Songs

Zhuoyang Tao

doi:10.62051/34tchf58

Authors

Zhuoyang Tao

DOI:

https://doi.org/10.62051/34tchf58

Keywords:

Spotify; song popularity; machine learning.

Abstract

Predicting song popularity has become a hot topic of research in recent years due to its necessity. This study investigates the effectiveness of different machine learning models in predicting the popularity of songs on Spotify using audio features. By comparing Linear Regression, Random Forest, and K-Nearest Neighbors (KNN), the research aims to identify the most suitable algorithm for this task. A dataset containing key musical attributes such as danceability, loudness, energy, tempo, and valence was preprocessed through standardization and one-hot encoding. Model performance was evaluated using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²). The results show that Random Forest outperforms the other models with the lowest prediction error and highest explanatory power. Additionally, feature importance analysis revealed that duration, speechiness, and emotional characteristics like energy and valence are more decisive in determining a song’s popularity, whereas musical key and mode are less influential. The study concludes that while audio features offer valuable insights, external factors such as playlist placement and social media trends should be considered in future work to improve prediction accuracy.

Downloads

Download data is not yet available.

References

[1] Spotify. About Spotify. Spotify, 2025. https://newsroom.spotify.com/company-info/

[2] Herremans D, Martens D, Sörensen K. Dance hit song prediction. Journal of New Music Research, 2014, 43(3): 291–302.

[3] James G, Witten D, Hastie T, Tibshirani R, Taylor J. Linear regression. In: An Introduction to Statistical Learning: With Applications in Python. Cham: Springer International Publishing, 2023: 69–134.

[4] Schedl M, Zamani H, Chen C W, Deldjoo Y, Elahi M. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval, 2018, 7(2): 95–116.

[5] Choi K, Fazekas G, Sandler M. A comparison of audio signal preprocessing methods for deep neural networks on music tagging. arXiv preprint, 2017. https://arxiv.org/abs/1709.01922

[6] RishabhPancholi1302. Spotify most popular songs dataset. Kaggle, 2024. https://www.kaggle.com/datasets/rishabhpancholi1302/spotify-most-popular-songs-dataset

[7] Anderson T W, Brown J R, Hall J W, Shephard R J. The limitations of linear regressions for the prediction of vital capacity and forced expiratory volume. Respiration, 1968, 25(2): 140–158.

[8] Nie L, Chu H, Liu C, Cole S R, Vexler A, Schisterman E F. Linear regression with an independent variable subject to a detection limit. Epidemiology, 2010, 21(4): S17–S24.

[9] Rigatti S J. Random forest. Journal of Insurance Medicine, 2017, 47(1): 31–39.

[10] Biau G, Scornet E. A random forest guided tour. Test, 2016, 25(2): 197–227.