Skip to content

Latest commit

 

History

History

policy_gradient

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Vanilla Policy Gradient Agent

This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the CartPole problem.

The agent is capable of taking in an observation of the world, then taking actions which provide the optimal reward not just in the present, but over the long run.

To take reward over time into account, we need to update our agent with more than one experience at a time. To accomplish this, we collect experiences in a buffer, and then occasionally use them to update the agent all at once. These sequences of experience are sometimes referred to as rollouts, or experience traces.

Rewards are discounted over time. We use this modified reward as an estimation of the advantage in our loss equation.