Agent has to decide what actions to take in order to learn the policy.

WE have to get all the Q values to get the optimal policy.

Algo. to learn the Q-values.

Transition : unk

Rewards : unk

Actions: to be taken by the agent