In this kind of reinforcement learning, we won't come up with a policy but we will learn based on some video / past experiences.
We will perform some predefined actions and then we will evaluate the episode and finally update state utilities which will cause updating the policy.
Simplified task : Policy evaluation
INput : a fixed policy
Transitions : unk
Rewards : unk
Goal : Learn the state values.
However this is not offline learning as the agent is actually learning by taking actions in the environment.
The agent will explore through an episode and explore through the environment till it reaches the end. It will keep a track of all the states and the rewards received at each states. We will average out all the total rewards that we received in all the episodes when we start from a certain state and the final value that we arrive at will be saved as the utility value for that state.
Act in the environment and use all the experiences from every transitions and update the state that we just left at every transition.
When at state s, we experienced a reward and computed the new experience that is R(s,a,s') + vV(s') that is called a sample/.
Update the V(s) = (1-alpha) V(s) + (alpha)(sample) at each sample
So based on the value of sample, if we did better than what we thought we should have done then shift the value up by a little bit. If we did worse, lower the value at that state.
We cannot determine the best policy however from the new temporal difference. It gives us estimates and it gives us difference in the values than what they should be.