289572
I have problems to understand update_q() method of class QLearningDecisionPolicy in Listing 8.10. The method takes action variable but does not use it. The point of the function is to update the NN model. Intuitively one needs both state and action variables to calculate q-values (rewards). Now reward is given for a chosen action and NN is updated by calculating q-values of current action and results from next possible actions.

I would do optimization step such that rewards of all actions are calculated and this forms q-vector that NN is optimized against.

At least this needs some more explanation in my opinion if logic is correct.

This is interesting topic that I have not found much in other books!


Nishant Shukla
I agree the logic in the RL chapter needs to be clarified. I'm going to take a deeper look and fix it up by next week. It's definitely a neat and novel example, so it's worth my time doing it right! smilie