A Study of Exploration-Exploitation Strategies in Unconventional Situations

Zhangqi Zheng

doi:10.62051/88exap78

Authors

Zhangqi Zheng

DOI:

https://doi.org/10.62051/88exap78

Keywords:

Exploration-exploitation problem; multi-armed bandits; reinforcement learning.

Abstract

The exploration-exploitation problem is a central challenge in Reinforcement Learning (RL), and the Multi-Armed Bandits (MAB) serve as its foundation, providing a classical paradigm for exploration and exploitation strategies. With the development of big data and deep learning, the application of RL models in online learning, recommender systems, and other fields has become increasingly complex, giving rise to variants of models such as multi-objective optimization and stochastic adversarial. This paper reviews the limitations of classical algorithms such as ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling in multi-armed bandit systems. It explores potential improvements in unconventional environments as far as the problem of rewards is concerned, which includes the case where the reward signal is time-varying and comes with some delay. And the limitations of traditional MAB, i.e., the inability to utilize contextual information, are explored in a relevant way. Meanwhile, scenario-oriented application-oriented MAB that are differentiated for real-world situations are mainly investigated as multi-objective, adversarial two major application-driven MAB. The cross-disciplinary characteristics of its variant algorithms are also examined to provide relevant algorithmic references for future research.

Downloads

Download data is not yet available.

References

[1] Scott Fujimoto, Shixiang Gu. A Minimalist Approach to Offline Reinforcement Learning. Advances in neural information processing systems, 2021, 34: 20132-20145.

[2] Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning. Advances in neural information processing systems, 2020, 33: 1179-1191.

[3] Volodymyr Mnih, Volodymyr Mnih, Volodymyr Mnih, et al. Human-Level Control through Deep Reinforcement Learning. Nature, 2015, 518(7540): 529-533.

[4] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.

[5] Aleksandrs Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning, 2019, 12(1-2): 1-286.

[6] Julian Zimmert, Yevgeny Seldin. Tsallis-inf: An optimal algorithm for Stochastic and Adversarial Bandits. Journal of Machine Learning Research, 2021, 22(28): 1-49.

[7] Zixian Yang, Xin Liu, Lei Ying. Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment. Journal of Machine Learning Research, 2024, 25(9): 1-55.

[8] Hamsa Bastani, Mohsen Bayati, Khashayar Khosravi. Mostly Exploration-Free Algorithms for Contextual Bandits. Management Science, 2021, 67(3): 1329-1349.

[9] Saeed Masoudian, Julian Zimmert, Yevgeny Seldin. A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback. Advances in Neural Information Processing Systems, 2022, 35: 11752-11762.

[10] Benjamin Howson, Ciara Pike-Burke, Sarah Filippi. Delayed Feedback in Generalised Linear Bandits Revisited.International Conference on Artificial Intelligence and Statistics. PMLR, 2023: 6095-6119.

[11] Emanuele Cavenaghi, Gabriele Sottocornola, Fabio Stella, Markus Zanker. Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm. Entropy, 2021, 23(3): 380.

[12] Yifu Tang, Yingfei Wang, Zeyu Zheng. Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays.International Conference on Artificial Intelligence and Statistics. PMLR, 2024: 3043-3051.

[13] Alaleh Ahmadianshalchi, Syrine Belakaria, Janardhan Rao Doppa. Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization. Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(10): 10784-10794.

[14] Garivier, Aurélien, and Wouter M. Koolen. Sequential Learning of the Pareto Front for Multi-Objective Bandits. International Conference on Artificial Intelligence and Statistics. PMLR, 2024: 3583-3591.

[15] Jiatai Huang, Leana Golubchik, Longbo Huang. When Lyapunov Drift Based Queue Scheduling Meets Adversarial Bandit Learning. IEEE.ACM Transactions on Networking, 2024, 32(4): 3034-3044.