-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot learn problems with a single, terminal reward #31
Comments
Hi, William! I think you are right. But have you validated it by conducting some experiments? |
On my toy problem which only has a nonzero reward on the terminal step, the agent cannot learn without this change. I haven't tested this on more complex problems like Atari games (I imagine that the impact shouldn't be too large as these games usually have many rewards on non-terminal steps) |
Hi, thanks for noticing this. Our implementation of A3C is indeed coherent with the original algo (see https://arxiv.org/pdf/1602.01783.pdf, page 14, where the reward is set to 0 for a terminal state). My intuition is that this is done because the expected value for the final state can only be zero (no rewards expected in the future). Nonetheless, your fix should allow using the algorithm also in the case of a game with only one, final reward. |
A3C correctly sets the value of terminal states to 0, but keeps the reward these terminal states give ("R ← r_i + γR" in the A3C pseudocode). GA3C sets both the reward and value of the terminal state to 0. In Pong, for example, where the terminal state also has a reward of -1 or 1 (indicating getting scored on or scoring), this causes no useful learning to happen in the experiences of the last round of play. |
Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:
terminal_reward = 0 if done else value
which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:
return experiences[:-1]
which causes the agent to ignore the only meaningful experience in this environment.
This is easily fixed by changing line 107 in ProcessAgent.py to
terminal_reward = reward if done else value
and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.
The text was updated successfully, but these errors were encountered: