308374 (3) [Avatar] Offline
I cannot fully understand why we have two value functions. State value depends on the action we take in that state. To me, a state has actions in itself. A good state should have a good chosen action (out of allowable actions in that state). why do we need to define another action value function? i.e. why state value is not sufficient?
Miguel Morales (15) [Avatar] Offline
Very interesting question. Think about it this way, the value of state is like a situation in your life, you won a tennis match, got married, went bankrupt. There is value in those situations regardless of how you act based on them. Now, the problem is with a state-value function, we still don't know how to act... you know going bankrupt sucks, but do you know what's the best thing to do if you were in that state? Obviously, there is a relationship there between the states and actions as doing one thing while being bankrupt is likely of lower value than doing "anything" while not being bankrupt. See, the action-value function tries to help you cache the values of action and state pairs. To make things even more interesting, some actions will have different values went taken in slightly different situations. Say you have children, maybe raising your voice is not a good idea or is it not a good idea in all situations??!? Sometimes investing in the stock market is a good idea, sometimes is not... so the action-value function is indexed by the state (the situation) and the action (how you react to the event).

One thing I should emphasize is the MDP/Environment and the Agent are opposites. The state and actions available are part of the environment, the MDP. The agent keeps track of value functions just for the sake of finding the best behavior (policies) given an MDP/problem. You can think of it as the environment select the states for your agent, your agent selects the action... sure, the environment will have some internal process that is likely based on the action your agent takes, but that, is not only stochastic in nature (meaning, you won't have certainty what it'll happen) but in reality, is not even visible to your agent.

Hope that helps. Let me know if you have any follow-up.
308374 (3) [Avatar] Offline
Thanks for the comprehensive reply and Happy New Year!
I sort of understand it but not fully. I understood that there is a hidden element of time and sequence here. By that I mean the action happens, then the result of that action is we land to a new (/maybe the existing) state and then we get the reward because of being in the new state.

So the sequence is this: < s1, a1, s2> --> get reward for being in s2 .... and repeat this again for new values...
The key here is how to select a1.

It seems I have confused the reward concept with the state- value. It seems they are different and they both can be used for action selection but I am not sure. And for that matter, I believe the action selection (selecting a1 vs. a2 or a3) can be done based on the reward that is associated to the target state (not the "value" of that state, that in this case, I do not know where that value state is going to be used and how that is related to/updated by the reward if any- More later about this matter)

When I am in s1, I have an idea about the result of doing action a1 (which it means the reward I get if I land in s2). It can be bankruptcy (bad reward) or winning the lottery (good reward). If I know that reward ahead of time, then I know if the action a1 has to be done or no. So selecting action a1 without knowing the reward of doing that (i.e. being in s2 , and collecting its reward) does not make sense.

It is like I know that for sure I will not win the lottery (bad reward) but I still select the action of buying the ticket. If I decide to buy a lottery ticket(the action selection process) then I know already there is a chance/probability of winning too (the reward of being in that state, i.e. lottery winner; a good reward). To me, the value of an action (action-value) is coming from the result of doing that action, which is totally related to reward I get for being in the resulting state and I understand that an action can have a value based on its knowledge about the resulting reward. And in this case, what is the state-value usage?
Maybe a better question is if we know the reward of a being in a state why do we need a state-value separately?