Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot learn problems with a single, terminal reward #31

Open
wagnew3 opened this issue Jun 30, 2017 · 4 comments
Open

Cannot learn problems with a single, terminal reward #31

wagnew3 opened this issue Jun 30, 2017 · 4 comments

Comments

@wagnew3
Copy link

wagnew3 commented Jun 30, 2017

Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:

terminal_reward = 0 if done else value

which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:

return experiences[:-1]

which causes the agent to ignore the only meaningful experience in this environment.

This is easily fixed by changing line 107 in ProcessAgent.py to

terminal_reward = reward if done else value

and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.

@tangbohu
Copy link

Hi, William! I think you are right. But have you validated it by conducting some experiments?

@wagnew3
Copy link
Author

wagnew3 commented May 29, 2018

On my toy problem which only has a nonzero reward on the terminal step, the agent cannot learn without this change. I haven't tested this on more complex problems like Atari games (I imagine that the impact shouldn't be too large as these games usually have many rewards on non-terminal steps)

@ifrosio
Copy link
Collaborator

ifrosio commented May 29, 2018

Hi, thanks for noticing this. Our implementation of A3C is indeed coherent with the original algo (see https://arxiv.org/pdf/1602.01783.pdf, page 14, where the reward is set to 0 for a terminal state). My intuition is that this is done because the expected value for the final state can only be zero (no rewards expected in the future). Nonetheless, your fix should allow using the algorithm also in the case of a game with only one, final reward.
This being said, I am not sure that A3C is the best algorithm for this case - you may have to dramatically change some of the hyper-parameters of the algo (t_max, for instance) to see some convergence, and in any case I do not expect convergence to be fast. This also obviously depends on the length of the episode in your toy-game.

@wagnew3
Copy link
Author

wagnew3 commented May 29, 2018

A3C correctly sets the value of terminal states to 0, but keeps the reward these terminal states give ("R ← r_i + γR" in the A3C pseudocode). GA3C sets both the reward and value of the terminal state to 0. In Pong, for example, where the terminal state also has a reward of -1 or 1 (indicating getting scored on or scoring), this causes no useful learning to happen in the experiences of the last round of play.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants