Multi-armed Bandits

Hi all, I wanted to get some insights on solving a problem that I’m trying to model as a bandit. I’m fairly new to the subject, so if I’m saying nonsensical things, please explain. Basically, the idea is that pulling an arm gets you a reward, but that reward depends on some factors that change, so pulling the same arm again won’t give the same reward. I tried to use epsilon greedy, and things sort of make sense. But, if I want to try UCB or Thompson sampling using Gaussian, it is unclear whether it would be appropriate. Because there is no need to keep pulling an arm if its reward is low when it has been tried only a few times. Depending on the reward design, it indicates that this need not be pulled. Arms, as such, may only be occasionally visited (like in epsilon). So, would this sort of behavior only be like a cold-start problem, and would Thompson eventually learn not to pick it? But how soon would that eventually be? I would appreciate any insights, and I can clarify more if needed, thanks!

submitted by /u/Leather_Amount_2268
[link] [comments]

Liked Liked