Agent has to decide what actions to take in order to learn the policy.
WE have to get all the Q values to get the optimal policy.
Algo. to learn the Q-values.
Transition : unk
Rewards : unk
Actions: to be taken by the agent