308374 wrote:If I am not mistaking, the idea of MAB algorithms are to find the best action being in a specific state. For example, in the maze code, if we have 4 actions R, L, U and D, I can use MAB to find the best action to choose. Am I correct in this understanding?
If yes, then I want to start making use of MAB algorithms(codes that we have) in improving the agent decision making process in the previous chapter (maze and agent). Would you please give me a few hints on how to employ the outputs of MAB codes in the agent decision making in the maze?
Thanks
Multi armed bandits are most useful in situations where the rewards have some distribution. Slots machines are a great example. Every time you pull the lever, there is a probability of getting a reward, and a probability distribution around what the reward will be.
For the maze running robot, all the rewards are -1 (or 0 for the move to the terminal state).
While it's true that the total potential future rewards for each state are different (proportional to the distance to the exit), the formulation for the multi armed bandit that we are using doesn't incorporate a memory. This means the MAB has no conception of total future rewards, only samples of rewards for a given action.
My suspicion is that it won't really learn to exit the maze, but it would be cool if you test it. To start, instead of just considering each action, consider each state-action pair. This changes the initialization function (so a list comprehension of state action pairs instead of just actions) and the greedy action selection.
|