The Author Online Book Forums are Moving

The Author Online Book Forums will soon redirect to Manning's liveBook and liveVideo. All book forum content will migrate to liveBook's discussion forum and all video forum content will migrate to liveVideo. Log in to liveBook or liveVideo with your Manning credentials to join the discussion!

Thank you for your engagement in the AoF over the years! We look forward to offering you a more enhanced forum experience.

308374 (11) [Avatar] Offline
#1
If I am not mistaking, the idea of MAB algorithms are to find the best action being in a specific state. For example, in the maze code, if we have 4 actions R, L, U and D, I can use MAB to find the best action to choose. Am I correct in this understanding?

If yes, then I want to start making use of MAB algorithms(codes that we have) in improving the agent decision making process in the previous chapter (maze and agent). Would you please give me a few hints on how to employ the outputs of MAB codes in the agent decision making in the maze?

Thanks
Phil Tabor (9) [Avatar] Offline
#2
308374 wrote:If I am not mistaking, the idea of MAB algorithms are to find the best action being in a specific state. For example, in the maze code, if we have 4 actions R, L, U and D, I can use MAB to find the best action to choose. Am I correct in this understanding?

If yes, then I want to start making use of MAB algorithms(codes that we have) in improving the agent decision making process in the previous chapter (maze and agent). Would you please give me a few hints on how to employ the outputs of MAB codes in the agent decision making in the maze?

Thanks


Multi armed bandits are most useful in situations where the rewards have some distribution. Slots machines are a great example. Every time you pull the lever, there is a probability of getting a reward, and a probability distribution around what the reward will be.

For the maze running robot, all the rewards are -1 (or 0 for the move to the terminal state).

While it's true that the total potential future rewards for each state are different (proportional to the distance to the exit), the formulation for the multi armed bandit that we are using doesn't incorporate a memory. This means the MAB has no conception of total future rewards, only samples of rewards for a given action.

My suspicion is that it won't really learn to exit the maze, but it would be cool if you test it. To start, instead of just considering each action, consider each state-action pair. This changes the initialization function (so a list comprehension of state action pairs instead of just actions) and the greedy action selection.