-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to compare this to universe-starter-agent (A3C) #22
Comments
The OpenAI agent uses an LSTM policy network & GAE for the loss function. This repo has a far simpler implementation of A3C, using a vanilla feed forward network for the policy & I'm pretty sure using a less recent loss function (though I haven't confirmed that last point recently). While I personally had high hopes that this implementation would be useful for speeding things up, I've recently gone back to working with the OpenAI framework for my testing. I think some people have been working to get the LSTM policy working w/ GPU based A3C, but I haven't seen any working code that improves on the OpenAI type model.... I'd love to be corrected if I'm incorrect on any of the above. |
ok, that explains it. Is "get LSTM policy working with GA3C" an open research problem or merely a matter of implementation details? |
And does Pong happen to be particularly sensitive to LSTM or would it be no different in the other Atari games? |
I did a few tests with the universe starter agent when it was just released. Based on that limited experience, it seemed that the setup was a bit overfit to Pong--performance was reasonable for other games, but exceptionally fast for Pong. But as the previous commenter mentioned, it also uses an LSTM and GAE, which are helpful in some cases. If you run more extensive tests, I'd be curious to know how it performs on a wider suite of games. |
okay; anyones in particular I should try?
…On 24 March 2017 at 15:35, swtyree ***@***.***> wrote:
I did a few tests with the universe starter agent when it was just
released. Based on that limited experience, it seemed that the setup was a
bit overfit to Pong--performance was reasonable for other games, but
exceptionally fast for Pong. But as the previous commenter mentioned, it
also uses an LSTM and GAE, which are helpful in some cases. If you run more
extensive tests, I'd be curious to know how it performs on a wider suite of
games.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABsunYj60Qwjk0Lont7EQhuj0ubhWKU_ks5ro9S6gaJpZM4MoUlS>
.
|
I did run it on CoasterRacer, and that "never" (for an impatient layperson)
seemed to get anywhere; the difference there as compared to another racing
game I briefly tried (Dusk Racer) is that it takes a significant amount of
effort to ever even get a single reward.
On 24 March 2017 at 15:36, Nicolai Czempin <[email protected]>
wrote:
… okay; anyones in particular I should try?
On 24 March 2017 at 15:35, swtyree ***@***.***> wrote:
> I did a few tests with the universe starter agent when it was just
> released. Based on that limited experience, it seemed that the setup was a
> bit overfit to Pong--performance was reasonable for other games, but
> exceptionally fast for Pong. But as the previous commenter mentioned, it
> also uses an LSTM and GAE, which are helpful in some cases. If you run more
> extensive tests, I'd be curious to know how it performs on a wider suite of
> games.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#22 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABsunYj60Qwjk0Lont7EQhuj0ubhWKU_ks5ro9S6gaJpZM4MoUlS>
> .
>
|
The appendix of the original A3C paper has a ton of comparisons across different games & models, which should help you avoid some testing. LSTM A3C is widely implemented open-source - a quick search should turn up a few options. The Universe & Miyosuda implementations seems to be the most commonly used. |
Not sure what this refers to; are you saying I could have avoided wasting time on CoasterRacer by being more aware of the comparisons? My goal was just to "play around with openai universe" rather than get deep into testing. If anything, I'd be interested in adding an environment such as MAME or one of the other emulators, which is more obviously an engineering task.
Is this a response to my question about GA3C with LSTM? If so, the implicit assumption is that there are no fundamental issues that would complicate an endeavour to do so, for example by looking at the A3C implementations. Is this what you're saying? My understanding from the GA3C paper is that they consider it to be a fundamental approach and that A3C just happened to perform the best, so adding LSTM should not be a big deal. |
also, what would be a better venue to have discussions such as this one? Don't really want to clutter up the project issues. |
I simply meant - there exists a readily available corpus of tests conducted by professional researchers. Use it as you wish. Implementing LSTM policy is simply an engineering issue, albeit a moderately difficult one in this case. Have at it - and please publish if you get good results! There are other issues open in this repo, I believe, where there are already discussions around LSTM/GPU. |
Well, the "which ones should I try" was really offering my "services" to @swtyree: in case I make some more comparisons with my setup anyway it doesn't make a big difference to me which other roms I try, so if someone does have a preference, I might as well support that.
"Publish" sounds intimidating to me, but if I do get anything off the ground, I promise to put the source up on github; perhaps fork and PR here. I probably have to brush up my background in this area a little first (and I definitely have some things I'd like to do first, as mentioned before), so don't hold your breath.
I saw an issue on the universe starter agent, asking about GPU. It doesn't seem to have gone anywhere. |
Please check out the pull requests section. |
GAE being? All I get is Google App Engine, and I don't find a reference to the term in the A3C paper. Edit: Generalized Advantage Estimation.
I'll have a look at that. Should I use a different game from the purportedly overfitted Pong, or would it be fine? I guess we'd know the answer when/if I try... |
|
Okay, I checked out the PR, but it breaks the dependencies on the vanilla openai-universe. I'm willing to give it a whirl if it once the PR is in a usable state more or less as-is. |
if you see some results from original paper, there are some good environments such as: |
or should I rename it to something that specifically references GAE? |
Ah, I see what you mean, with those games LSTM didn't actually help that
much (although the same holds for pong). GAE doesn't seem to be isolated in
the table; I guess I'll have to read the paper a bit more.
In the meantime I'll give Amidar a whirl. You seem to have picked the bold
ones in the "A3C FF, 1 day" column, would it also make sense to try
Seaquest, if FF 1 day vs. LSTM is what we're looking at?
…On 24 March 2017 at 17:04, Dennis Korotyaev ***@***.***> wrote:
if you see some results <https://arxiv.org/pdf/1602.01783.pdf#19> from
original paper, there are some good environments such as: Amidar, Berzerk
and Krull for faster converge. But DeepMind trained all of these games
with the same parameters, since that the gamma (discount factor) can be
taken for each environment individually to get the better results.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABsunQe8uFrIsGD6vHSvYtm9izPbGfz8ks5ro-mDgaJpZM4MoUlS>
.
|
@nczempin you can try with |
So after around 24000 seconds (400 minutes, 6.6667 hours), here's what I get with GA3C with my 3930k, 32 GB and GTX 1060 (6GB):
It seemed to make progress right from the start, unlike with Pong, where both algorithms seemed to be clueless for a while and then "suddenly get it" and no longer lose, followed by a very long time of very slow growth of average score (the points it conceded always seemed to be the very first few ones, once it had one a single point it seemed to go into very similar states. GA3C on Amidar seems to be stuck just under 270; I will now see what I get on the same machine with universe-starter-agent. |
Based on the latest version of our paper, we get more stable and faster convergence for TRAINING_MIN_BATCH_SIZE = 20 ... 40 in Config.py. If you haven't done it yet, you can try this. |
On Pong again or on any of the other ones I'll try? |
@nczempin |
The improvement with TRAINING_MIN_BATCH_SIZE should be observed for all games (although we tested few of them only). |
I picked 6 workers because that's how many cores my CPU has, but perhaps up to 12 could have helped, given Hyperthreading etc. But naive "analysis" suggests that GA3C still wins in this particular case, because it gets more than double the score. It would be interesting to know how much the speedup is due to using the CPU cores more efficiently because of the dynamic load balancing vs. including the GPU. Even just getting a dynamic number of threads, without any specific GPU improvements, is a big convenience over having to pick them yourself statically. |
GA3C |
GA3C
|
You are right @etienne87 >> actions should remaps too if it applicable --> and it's simple enough for left-right flipping, where we've to remap
Yes, they call it
@nczempin as I see everything correctly --> there are no improvements from 6 to 12 agents in time >> |
@nczempin hm, I've also notices that the episode time becomes a bit shorter for your |
Are you saying either of the A3Cs have implicit assumptions about episode lengths being more uniform? Or just the original one? In general, eventually agents will reach maximum scores. In ALE for many of them that roll over the score, this is not actually possible; an episode could potentially play to infinity. It is an open question there how to handle the situation of wrapping. Depending on how the Python interface is used, agents might be discouraged to wrap the score. IMHO it's pointless to keep going once you can wrap the score (for an agent; not necessarily for a human). |
I thought I saw I slight improvement in wall-clock time, but I didn't look at it in detail. I guess I should have included the fps images as well. |
No, you just control the training quality by acquiring of some reward (discounted reward). |
Okay, I think I need to clarify what I was referring to: I don't understand this statement:
It sounded to me like you're saying that episodes terminating early is somehow a problem, because some trait of "current A3C" somehow optimizes for episodes that don't terminate early. |
|
Really? What is the motivation behind this?
Wouldn't normalizing be better than clipping? And for both options, wouldn't knowing the max score be helpful? I can't even imagine how you'd clip or normalize without knowing the max (other than the max-seen-so-far), to be honest. Once (,if this gets implemented,) (some of) the Atari games get to have maximum values in ALE, e. g.
I also thought about including other reward signals in ALE, but in the end the number of lives is just part of the state, and with behaviour that minimizes losing lives presumably you'd maximize the score. Or maximizing the rewards will turn out to involve avoiding to lose lives. |
Regular GA3C stuck near 1700 points on Seaquest after 12 hours (but still better than universe starter agent):
|
after a promising start, GAE gets stuck near 212 points:
|
Montezuma's Revenge, basic GA3C (one can hope):
|
I've just added Adventure to ALE:, that might be even harder than Montezuma's Revenge with current algorithms. I wonder if intrinsic motivation would help it, like it did Montezuma's (a little bit; Adventure is not quite as dangerous as Montezuma's, but the rewards are even sparser). |
I'm trying to move my changes to ALE into gym; it's quite tedious because they have diverged, and it's not immediately obvious in what way. |
Okay, I seem to have managed to get it to work; here's
Really wondering if it will ever get a +1. Any tips on which implementation I should pick to make this more likely would be appreciated. Would it be of any help to step in and control manually (sorta off-policy learning)? |
Huh, the agent reached a score of 0. That's only possible by timing out of the episode. I hope it doesn't learn to sit idly at home forever...
Which parameters do I need to set so it will eventually explore to bringing the chalice back to the golden castle? I'm guessing there's no hope yet; it may require custom rewards to encourage exploration, opening castle doors, etc. I'm currently looking into providing custom rewarders for |
Unsurprisingly, the GA3C agent got nowhere on Adventure. Perhaps "distance of chalice to yellow castle" and "have discovered chalice" should somehow be added as rewards, but the values would be somewhat arbitrary.
|
In any case, I think I've answered my original question about GA3C vs. universe-starter-agent and will close this huge thread now. |
So ... what are the conclusions? |
GA3C indeed makes better use of available resources; the GAE can help a lot but needs some parameter tweaking that I'm not ready for. But mainly the conclusion is that I need to spend more time learning about how all of these things work before I can actually give a conclusion. So I'll continue to try and help in areas where I can bring in my skills (e. g. adding more games to ALE, perhaps add an environment to Gym, other engineering-focused tasks that will satisfy my curiosity and perhaps help the researchers or others), and go right back to Supervised Learning, with maybe a little bit of generic (not general) AI thrown in, plus work my way through Tensorflow tutorials (possibly look at the other libraries), maybe implement some of the classic algorithms from scratch myself, etc. |
Indeed - I think that the main motivation about it that terminal isn't good (for most Atari games, since also we use the same set of parameters for all games) and we also can't do some estimation by
Perhaps, but we have to do some investigation of pros and cons wrt rewards changes.
It's good to do some reasonable investigation in this way.
Hm, I don't think so, cuz we just got the raw image as state.
Mostly yup, especially if it slowdown rewards gain. But some
You are right. The
It's great to hear. I also recommend to see on Retro-Learning-Environment (RLE) - it cover not only
Yup, or do something like this
Yeah - it's the main reason to organize your data workflow in more efficient way. |
Okay, I have to be careful here; what I meant was, the number of lives is (in many cases, ALE does indeed allow to query this, but it's not strictly necessary) part of the internal state of the game, not necessarily of the state that's observed by the agent (which is just pixels). |
But then how does GA3C even see the -1 on Adventure? Or are you saying the original A3C does it, while GA3C doesn't? Adventure provides a good reason not simply to use 0 upon "fail", because there is a difference between "failing because eaten" and "failing because we timed out". Hm, or maybe there isn't. |
No, all rewards for both should represents (since clipping) as def discounted_reward(real_rewards, gamma):
discounted_r = np.zeros_like(real_rewards, dtype=np.float32)
running_add = 0
for t in reversed(range(0, discounted_r.size)):
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
rew_test = [1, -1, 0, 1, 0]
print(discounted_reward(rew_test, .95))
[ 0.72187066 -1.31918108 0.80844843 0.91000599 -1.12114406] And if |
Still not sure I understand: Are you saying shooting a regular space invader gives as much reward as shooting a mothership? The discounting is orthogonal to that question. Apart from that, I'm not really in a position to argue about any of this. When I have a better understanding of actor-critic and all that I need to know before that, I may revisit this. So far I've watched the David Silver lectures, and I have some catching up to do. In my engineer mindset I also like to implement all these different techniques like TD(lambda) etc., and obviously there is a lot I have to do before I even get to regular AC. |
Yes, but I don't really know about space invaders scores > all of them is clipped in Discounting reward it just some technique wrt horizon of view to our received rewards. It could be more optimistic, for example, if we hit "motherships". And it is also affect on behavior, for example again: |
Not sure I follow. The rewards come from the environment, the algorithm is trying to figure out how to get these rewards and the algorithms are ranked based on how well they do compared to other algorithms (or humans) in score. If you treat a regular space invader the same as a mothership (indirectly; your algorithm knows nothing about different types, it just knows that in state s1 it was better to move slightly to the right and then shoot, to get the (points for the) mothership rather than to the left and get the points for the regular invader. That is completely general as long as the environment gives out rewards. As I said, I know what discounting reward is and what it is for; like in finance, getting a reward now is better than getting a rewards tomorrow, and how much better is determined by the discount factor, which is usually a measure of uncertainty; in finance it is dynamic, based on risk vs. reward. But the discount factor doesn't have anything to do with the reward coming from a mothership or not, unless your algorithm takes into account that to get the higher score it also risks dying more often. And when you have very sparse rewards, it makes sense to have a low discount factor (a high gamma), because otherwise the reward might disappear after enough steps due to rounding. Although technically if the rewards are really sparse (like in Adventure, only +1 right at the end) it shouldn't make any difference as long as you don't round to 0. The 1 will always be more than any other reward. I guess in that case (rounding it away) it may even make sense to dynamically adjust the gamma: If you keep finishing episodes without getting any rewards, gamma should eventually be increased so that later rewards eventually get counted in the present. |
@nczempin you definitely right. |
Setting up openai/universe, I used the "universe starter agent" as a smoke test.
After adjusting the number of workers to better utilize my CPU, I saw the default PongDeterministic-v3 start winning after about 45 minutes.
Then I wanted to try GA3C on the same machine; given that you quote results of 6x or better, I expected it to perform at least as good as that result.
However, it turns out that with GA3C the agent only starts winning after roughly 90 minutes.
I'm assuming that either my first (few) run(s) on the starter agent were just lucky, or that my runs on GA3C were unlucky. Also I assume that the starter agent has other changes from the A3C that you compared GA3C against, at least in parameters, possibly in algorithm.
So, what can I (an experienced software engineer but with no background in ML), do to make the two methods more comparable on my machine? Is it just a matter of tweaking a few parameters? Is Pong not a good choice to make the comparison?
I have an i7-3930k, a GTX 1060 (6 GB) and 32 GB of RAM.
The text was updated successfully, but these errors were encountered: