Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I cannot get FrozenLake-v0 OpenAMI Gym to Perform consistently more than 78% #43

Open
jinzishuai opened this issue Jan 29, 2018 · 5 comments

Comments

@jinzishuai
Copy link
Owner

FrozenLake-v0 defines "solving" as getting average reward of 0.78 over 100 consecutive trials.

But my results are around 70-75%: https://github.com/jinzishuai/learn2deeplearn/blob/master/learnRL/OpenAIGym/FrozenLake/results.md

we should test this which claims to be 0.79 ± 0.05: https://gym.openai.com/evaluations/eval_4VyQBhXMRLmG9y9MQA5ePA/

@jinzishuai
Copy link
Owner Author

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 15:47:57,703] Making new env: FrozenLake-v0
Generation 1 : max score = 0.20
Generation 2 : max score = 0.30
Generation 3 : max score = 0.60
Generation 4 : max score = 0.66
Generation 5 : max score = 0.79
Generation 6 : max score = 0.84
Generation 7 : max score = 0.80
Generation 8 : max score = 0.78
Generation 9 : max score = 0.79
Generation 10 : max score = 0.80
Generation 11 : max score = 0.80
Generation 12 : max score = 0.80
Generation 13 : max score = 0.82
Generation 14 : max score = 0.80
Generation 15 : max score = 0.86
Generation 16 : max score = 0.81
Generation 17 : max score = 0.81
Generation 18 : max score = 0.78
Generation 19 : max score = 0.84
Generation 20 : max score = 0.79
Best policy score = 0.85. Time taken = 52.1701
Best policy = [[0 3 3 3]
 [0 3 0 1]
 [3 1 0 3]
 [3 2 1 0]]

@jinzishuai
Copy link
Owner Author

But it only performed at less than 75%

Not any better than my algorithms.

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake>python fl_human_policy.py
[2018-01-29 15:51:33,075] Making new env: FrozenLake-v0
policy=
[[ 0  3  3  3]
 [ 0 -1  0 -1]
 [ 3  1  0 -1]
 [-1  2  1 -1]]

7355 out of 10000 runs were successful

@jinzishuai
Copy link
Owner Author

Original Sample Size=100

There is a clearly a difference between the first and second computation of the scores due to stochasticity:

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 20:29:33,700] Making new env: FrozenLake-v0
Generation 1 : max score = 0.15, recomputed score=0.16
Generation 2 : max score = 0.34, recomputed score=0.28
Generation 3 : max score = 0.65, recomputed score=0.59
Generation 4 : max score = 0.74, recomputed score=0.84
Generation 5 : max score = 0.80, recomputed score=0.84
Generation 6 : max score = 0.82, recomputed score=0.67
Generation 7 : max score = 0.85, recomputed score=0.67
Generation 8 : max score = 0.81, recomputed score=0.74
Generation 9 : max score = 0.81, recomputed score=0.61
Generation 10 : max score = 0.79, recomputed score=0.61
Generation 11 : max score = 0.78, recomputed score=0.70
Generation 12 : max score = 0.83, recomputed score=0.71
Generation 13 : max score = 0.82, recomputed score=0.74
Generation 14 : max score = 0.81, recomputed score=0.67
Generation 15 : max score = 0.77, recomputed score=0.70
Generation 16 : max score = 0.80, recomputed score=0.76
Generation 17 : max score = 0.80, recomputed score=0.67
Generation 18 : max score = 0.83, recomputed score=0.76
Best policy score = 0.84. Time taken = 90.5285
Best policy = [[0 3 3 3]
 [0 2 2 2]
 [3 1 0 2]
 [2 2 1 3]]
best policy scoe = 0.71

Increase sample size to 1000

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 20:31:13,861] Making new env: FrozenLake-v0
Generation 1 : max score = 0.17, recomputed score=0.17
Generation 2 : max score = 0.28, recomputed score=0.29
Generation 3 : max score = 0.60, recomputed score=0.59
Generation 4 : max score = 0.71, recomputed score=0.69
Generation 5 : max score = 0.70, recomputed score=0.70
Generation 6 : max score = 0.70, recomputed score=0.69
Generation 7 : max score = 0.75, recomputed score=0.74
Generation 8 : max score = 0.74, recomputed score=0.73
Generation 9 : max score = 0.74, recomputed score=0.76
Generation 10 : max score = 0.75, recomputed score=0.72
Generation 11 : max score = 0.75, recomputed score=0.73
Generation 12 : max score = 0.75, recomputed score=0.72
Generation 13 : max score = 0.75, recomputed score=0.72
Generation 14 : max score = 0.76, recomputed score=0.73
Generation 15 : max score = 0.76, recomputed score=0.72
Generation 16 : max score = 0.75, recomputed score=0.73
Generation 17 : max score = 0.75, recomputed score=0.71
Generation 18 : max score = 0.75, recomputed score=0.73
Best policy score = 0.77. Time taken = 859.8491
Best policy = [[0 3 3 3]
 [0 3 2 1]
 [3 1 0 0]
 [2 2 1 1]]
best policy scoe = 0.74

@jinzishuai
Copy link
Owner Author

jinzishuai commented Jan 30, 2018

Sample Size 10k

def evaluate_policy(env, policy, n_episodes=10000):
    total_rewards = 0.0
    for _ in range(n_episodes):
        total_rewards += run_episode(env, policy)
    return total_rewards / n_episodes

results:

E:\ShiJin\learn2deeplearn\learnRL\OpenAIGym\FrozenLake\genetic>python frozenlake_genetic_algorithm.py
[2018-01-29 20:49:12,746] Making new env: FrozenLake-v0
Generation 1 : max score = 0.16, recomputed score=0.16
Generation 2 : max score = 0.30, recomputed score=0.30
Generation 3 : max score = 0.48, recomputed score=0.48
Generation 4 : max score = 0.69, recomputed score=0.70
Generation 5 : max score = 0.74, recomputed score=0.74
Generation 6 : max score = 0.74, recomputed score=0.74
Generation 7 : max score = 0.74, recomputed score=0.74
Generation 8 : max score = 0.73, recomputed score=0.73
Generation 9 : max score = 0.75, recomputed score=0.74
Generation 10 : max score = 0.74, recomputed score=0.74
Generation 11 : max score = 0.74, recomputed score=0.74
Generation 12 : max score = 0.74, recomputed score=0.74
Generation 13 : max score = 0.75, recomputed score=0.74
Generation 14 : max score = 0.74, recomputed score=0.74
Generation 15 : max score = 0.75, recomputed score=0.74
Generation 16 : max score = 0.75, recomputed score=0.74
Generation 17 : max score = 0.75, recomputed score=0.75
Generation 18 : max score = 0.75, recomputed score=0.75
Best policy score = 0.75. Time taken = 9602.4171
Best policy = [[0 3 3 3]
 [0 2 0 1]
 [3 1 0 3]
 [1 2 1 3]]
best policy scoe = 0.74

@jinzishuai
Copy link
Owner Author

Conclusion: expectation of best results 74-75%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant