Name		Name	Last commit message	Last commit date
parent directory ..
aac_kungfumaster		aac_kungfumaster
bernoulli_bandits		bernoulli_bandits
crossentropy		crossentropy
dqn_breakout		dqn_breakout
environments		environments
frozen_lake		frozen_lake
mcts_planning		mcts_planning
policy_gradients_pong		policy_gradients_pong
qlearning		qlearning
reinforce_cartpole		reinforce_cartpole
rl_for_seq2seq		rl_for_seq2seq
routing_optimization		routing_optimization
survey_of_methods		survey_of_methods
README.rst		README.rst
__init__.py		__init__.py
util.py		util.py

README.rst

Reinforcement Learning

Frame Buffer

Environments which follow a structure where a given state conveys everything the agent needs to act optimally are called Markov Decision Processes (MDPs).

While MDPs provide a nice formalism, almost all real world problems fail to meet this standard. Take for example your field of view at this very moment. Can you see what is behind you? Information outside our view is often essential to making decisions regarding the world.

In addition to being spatially limited, information available at a given moment is also often temporally limited. When looking at a photo of a ball being thrown between two people, the lack of motion may make us unable to determine the direction and speed of the ball. In games like Pong, not only the position of the ball, but also it’s direction and speed are essential to making the correct decisions.

Environments which present themselves in a limited way to the agent are referred to as Partially Observable Markov Decision Processes (POMDPs). While they are trickier to solve than their fully observable counterparts, understanding them is essential to solving most realistic tasks.

How can we build a neural agent which still functions well in a partially observable world? The key is to give the agent a capacity for temporal integration of observations.

Within the context of Reinforcement Learning, there are a number of possible ways to accomplish this temporal integration. The solution taken by DeepMind in their original paper on Deep Q-Networks was to stack the frames from the Atari simulator. Instead of feeding the network a single frame at a time, they used an external frame buffer which kept the last four frames of the game in memory and fed this to the neural network.

Recurrent Neural Networks

All of these issues can be solved by moving the temporal integration into the agent itself. This is accomplished by utilizing a recurrent block in our neural agent.

The class of agents which utilize this recurrent network are referred to as Deep Recurrent Q-Networks (DRQN).

We need to adjust the way our experience buffer stores memories. Since we want to train our network to understand temporal dependencies, we can’t use random batches of experience. Instead we need to be able to draw traces of experience of a given length. In this implementation, our experience buffer will store entire episodes, and randomly draw traces of 8 steps from a random batch of episodes. By doing this we both retain our random sampling as well as ensure each trace of experiences actually follows from one another.

We will be utilizing a technique developed by a group at Carnegie Mellon who used a DRQN to train a neural network to play the first person shooter game Doom. Instead of sending all the gradients backwards when training their agent, they sent only the last half of the gradients for a given trace.

See https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-6-partial-observability-and-deep-recurrent-q-68463e9aeefc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rl

rl

README.rst

Reinforcement Learning

Frame Buffer

Recurrent Neural Networks

Files

rl

Directory actions

More options

Directory actions

More options

Latest commit

History

rl

Folders and files

parent directory

README.rst

Reinforcement Learning

Frame Buffer

Recurrent Neural Networks