367062 (2) [Avatar] Offline

In section 8.4 listing 8.10 the calculation of the reference value for the learning is defined as follows:

action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0,

I don't quite understand why this works in terms of units and scales. The reward is some amount of money and in a scale depending on budget and share prices. The utility of next_action_q_vals should be some "arbitrary" number depending on the input vector, the weights and the regularisation. Also we can't know the range of q.

As far as i understand adding the reward to the utility is supposed to increase or decrease the expected utility in the next step. But it should be possible (even expected, i assume) that the utilities and the rewards end up in very different ranges. So lets say utility values from 10K to 50K and reward values from 10.0 to 30.0 (Especially since we only sale one share at a time). So it could be really hard to change the decision for the next step by means of adding the reward.

Hope I made my issue / question clear. So, i basically would at least expect some kind of normalisation of reward (money) and utility (unit less) before the update.

EDIT: Actually, after some more consideration I also don't quite understand why in

action_q_vals[0, next_action_idx] = reward + self.gamma * next_action_q_vals[0,

next_action_idx is modified and not the current action. Shouldn't we reward or punish the current action given then current state and the expectations of the next step?

Also, why does it makes sense to use the utility of the next step without checking the actual reward of the next step (like defined in the formula on page 147)?

I would in general like a bit more explanation in the text why update_q in 8.10 is implemented in this way. The general idea of RL is well communicated in the chapter, i would say, but the details of how to do it in specific situations are still important.

Thank in advance for help!