Implementation of the OP-Algorithm as described by Munos and Busoniu
Due to the lack of time, the implementation works, but I could not document it properly. Furthermore, I wanted to implement a "memory" for states that are recurrent in order to optimize the original algorithm...