Q-learning – finding an optimal policy on the go