Cannot learn problems with a single, terminal reward #31

wagnew3 · 2017-06-30T07:56:46Z

Thank you for the easy to use and fast A3C implementation. I created a simple problem for rapid testing that rewards 0 on all steps except the terminal step, where it rewards either -1 or 1. GA3C cannot learn this problem because of line 107 in ProcessAgent.py:

terminal_reward = 0 if done else value

which causes the agent to ignore the only meaningful reward in this environment, and line 63 in ProcessAgent.py:

return experiences[:-1]

which causes the agent to ignore the only meaningful experience in this environment.

This is easily fixed by changing line 107 in ProcessAgent.py to

terminal_reward = reward if done else value

and _accumulate_rewards() in ProcessAgent.py to return all experiences if the agent has taken a terminal step. These changes should generally increase performance as terminal steps often contain valuable reward signal.

tangbohu · 2018-05-28T09:09:02Z

Hi, William! I think you are right. But have you validated it by conducting some experiments?

wagnew3 · 2018-05-29T21:08:52Z

On my toy problem which only has a nonzero reward on the terminal step, the agent cannot learn without this change. I haven't tested this on more complex problems like Atari games (I imagine that the impact shouldn't be too large as these games usually have many rewards on non-terminal steps)

ifrosio · 2018-05-29T21:17:24Z

Hi, thanks for noticing this. Our implementation of A3C is indeed coherent with the original algo (see https://arxiv.org/pdf/1602.01783.pdf, page 14, where the reward is set to 0 for a terminal state). My intuition is that this is done because the expected value for the final state can only be zero (no rewards expected in the future). Nonetheless, your fix should allow using the algorithm also in the case of a game with only one, final reward.
This being said, I am not sure that A3C is the best algorithm for this case - you may have to dramatically change some of the hyper-parameters of the algo (t_max, for instance) to see some convergence, and in any case I do not expect convergence to be fast. This also obviously depends on the length of the episode in your toy-game.

wagnew3 · 2018-05-29T21:33:55Z

A3C correctly sets the value of terminal states to 0, but keeps the reward these terminal states give ("R ← r_i + γR" in the A3C pseudocode). GA3C sets both the reward and value of the terminal state to 0. In Pong, for example, where the terminal state also has a reward of -1 or 1 (indicating getting scored on or scoring), this causes no useful learning to happen in the experiences of the last round of play.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot learn problems with a single, terminal reward #31

Cannot learn problems with a single, terminal reward #31

wagnew3 commented Jun 30, 2017

tangbohu commented May 28, 2018

wagnew3 commented May 29, 2018

ifrosio commented May 29, 2018

wagnew3 commented May 29, 2018

Cannot learn problems with a single, terminal reward #31

Cannot learn problems with a single, terminal reward #31

Comments

wagnew3 commented Jun 30, 2017

tangbohu commented May 28, 2018

wagnew3 commented May 29, 2018

ifrosio commented May 29, 2018

wagnew3 commented May 29, 2018