-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If alpha zero compete with a TicTacToe beginner, how does it ensure it never loss? (a convergence question) #252
Comments
The alpha-zero agent can learn simple games, such as TicTacToe very quickly, to the point where it has solved them after only a few iterations. In the case of TicTacToe, I found that the agent learns that, against an optimal player, it can never win. This leads to the agent playing only moves that would result in a draw and it doesn't seem to try to win. |
Yeah, the predicted state value for the first player is really high, but the value for the second player is really low. It's sufficient to say the second player cannot win if the first player is an expert. Alpha-zero might think this game is too unequal, so it's unnecessary to train anymore... Also, after many iterations, alpha-zero for tictactoe will draw frequently, so if the earlier training data are removed, the model might forget some tictactoe records for beginners. I think it does matter if a beginner is the first player and alpha-zero is the second player since the second player has only trained with an expert who always starts the game in a certain sequence of moves... I think playing with a random player is a simple but valid solution. Probably for Go or Chess, it won't be a problem. |
Yes, maybe a good training method for TicTacToe would be to to play fewer games per iteration (maybe around 20?) as there are far fewer possible states. I don't think it would take more than 20 iterations for the model to solve the game, even with this data, so the model would still be training on "beginner" data. |
How about adding a little bit of noise to the first few moves of a game to randomize each opening? This may lead to it knowing how to play all types of openings and not just optimal ones |
What I mean is that after a few iterations of self-play and training, I think MCT will converge and always place the first piece in the center of the 3*3 board (starting from the center has the highest probability to win). Then, future games generated by self-games will always start from the center.
Since the early dataset will be removed from the buffer (due to maximum size limitation), the dataset used for training will contain only expert-level games. Since neural networks may forget the early training patterns as the early training data are removed, I think future MCTs will forget how to deal with a beginner who may not put the first piece in the center.
If I place the first piece in the corner, and alpha zero is the second player, will alpha zero never lose?
The text was updated successfully, but these errors were encountered: