-
Notifications
You must be signed in to change notification settings - Fork 40
Notes on Training Data
There are two types of training data formats used as input for the learning programs in this project. One is for behavioral cloning (a.k.a. supervised learning), and the other is for offline reinforcement learning. The former is the format obtained by converting Mahjong Soul game records using cryolite/kanachan.annotate
. The latter is the format obtained by converting annotation data in the former format using bin/annotate4rl/annotate4rl.py
.
Roughly speaking, the training data format for behavioral cloning represents the set of triplets, which consist of the situation of a decision-making point (see Annotate for the definition of a decision-making point), the actual action taken by the player at that point, and the results of the round and game where that point appears.
In this format, the annotation of a decision-making point is represented by one text line. Each line is tab-separated into 7 fields, and each field is in turn comma-separated into elements. In each line, the first column is for debugging purposes only, the next 3 columns represent the situation of a decision-making point, the next 2 columns represent the actual action taken by the player at that point, and the final column represents the round and game results.
Before explaining the details of each field in an annotation, the following explains the conventions used in annotations.
Each player, of course there are four players in a 4-player mahjong game, is distinguished by the notion of "seat"; the 0th seat is the dealer (zhuang jia, 荘家) of the start of a game (qi jia, 起家), the 1st seat the right next to the 0th seat (xia jia of qi jia, 起家の下家), the 2nd seat the one across from the 0th seat (dui mian of qi jia, 起家の対面), and the 3rd seat the left next to the 0th seat (shang jia of qi jia, 起家の上家).
Seat | Meaning |
---|---|
0 |
the dealer of the start of a game |
1 |
the right next to the 0th seat |
2 |
the one across from the 0th seat |
3 |
the left next to the 0th seat |
There are cases where the relative positions of two players need to be represented. For example, complete information about a pon (peng, 碰, ポン) includes information about who melds the pon and who discards the melded tile. In such a case, one information is represented by a seat index, and the other information is represented by the position relative to the former.
Relseat | Meaning |
---|---|
0 |
the player right next to the player of interest |
1 |
the player across from the player of interest |
2 |
the player left next to the player of interest |
The type of a tile is represented by an integer from 0 to 36, inclusive.
Tile | Value |
---|---|
0m ~ 9m |
0 ~ 9
|
0p ~ 9p |
10 ~ 19
|
0s ~ 9s |
20 ~ 29
|
1z ~ 7z |
30 ~ 36
|
There is no need to distinguish between black and red tiles of certain kinds to indicate a type of closed kong (an gang, 暗槓). In such a case, the 34 types of tiles excluding red ones are represented by integers from 0 to 33, inclusive.
Tile | Value |
---|---|
1m ~ 9m |
0 ~ 8
|
1p ~ 9p |
9 ~ 17
|
1s ~ 9s |
18 ~ 26
|
1z ~ 7z |
27 ~ 33
|
The grade (段位) is represented by integers from 0 to 15, inclusive.
Grade | Value |
---|---|
Novice (初心) 1~3 |
0 ~ 2
|
Adept (雀士) 1~3 |
3 ~ 5
|
Expert (雀傑) 1~3 |
6 ~ 8
|
Master (雀豪) 1~3 |
9 ~ 11
|
Saint (雀聖) 1~3 |
12 ~ 14
|
Celestial (魂天) | 15 |
Chows are represented by integers from 0 to 89, inclusive.
Value | Chow (The last element represents the discarded tile) |
---|---|
0 |
(2m, 3m, 1m) |
1 |
(1m, 3m, 2m) |
2 |
(3m, 4m, 2m) |
3 |
(1m, 2m, 3m) |
4 |
(2m, 4m, 3m) |
5 |
(4m, 5m, 3m) |
6 |
(4m, 0m, 3m) |
7 |
(2m, 3m, 4m) |
8 |
(3m, 5m, 4m) |
9 |
(3m, 0m, 4m) |
10 |
(5m, 6m, 4m) |
11 |
(0m, 6m, 4m) |
12 |
(3m, 4m, 5m) |
13 |
(3m, 4m, 0m) |
14 |
(4m, 6m, 5m) |
15 |
(4m, 6m, 0m) |
16 |
(6m, 7m, 5m) |
17 |
(6m, 7m, 0m) |
18 |
(4m, 5m, 6m) |
19 |
(4m, 0m, 6m) |
20 |
(5m, 7m, 6m) |
21 |
(0m, 7m, 6m) |
22 |
(7m, 8m, 6m) |
23 |
(5m, 6m, 7m) |
24 |
(0m, 6m, 7m) |
25 |
(6m, 8m, 7m) |
26 |
(8m, 9m, 7m) |
27 |
(6m, 7m, 8m) |
28 |
(7m, 9m, 8m) |
29 |
(7m, 8m, 9m) |
30 ~ 59
|
Likewise for Circle tiles (筒子) |
60 ~ 89
|
Likewise for Bamboo tiles (索子) |
Pons are represented by integers from 0 to 39, inclusive.
Value | Pon (The last element represents the discarded tile) |
---|---|
0 |
(1m, 1m, 1m) |
1 |
(2m, 2m, 2m) |
2 |
(3m, 3m, 3m) |
3 |
(4m, 4m, 4m) |
4 |
(5m, 5m, 5m) |
5 |
(0m, 5m, 5m) |
6 |
(5m, 5m, 0m) |
7 |
(6m, 6m, 6m) |
8 |
(7m, 7m, 7m) |
9 |
(8m, 8m, 8m) |
10 |
(9m, 9m, 9m) |
11 |
(1p, 1p, 1p) |
12 |
(2p, 2p, 2p) |
13 |
(3p, 3p, 3p) |
14 |
(4p, 4p, 4p) |
15 |
(5p, 5p, 5p) |
16 |
(0p, 5p, 5p) |
17 |
(5p, 5p, 0p) |
18 |
(6p, 6p, 6p) |
19 |
(7p, 7p, 7p) |
20 |
(8p, 8p, 8p) |
21 |
(9p, 9p, 9p) |
22 |
(1s, 1s, 1s) |
23 |
(2s, 2s, 2s) |
24 |
(3s, 3s, 3s) |
25 |
(4s, 4s, 4s) |
26 |
(5s, 5s, 5s) |
27 |
(0s, 5s, 5s) |
28 |
(5s, 5s, 0s) |
29 |
(6s, 6s, 6s) |
30 |
(7s, 7s, 7s) |
31 |
(8s, 8s, 8s) |
32 |
(9s, 9s, 9s) |
33 |
(1z, 1z, 1z) |
34 |
(2z, 2z, 2z) |
35 |
(3z, 3z, 3z) |
36 |
(4z, 4z, 4z) |
37 |
(5z, 5z, 5z) |
38 |
(6z, 6z, 6z) |
39 |
(7z, 7z, 7z) |
The 0th field is the game UUID, which uniquely identifies the game in which the decision-making point appears. This field is for debugging purposes only and is not used for training at all.
The 1st field consists of sparse features. All the elements in this field are an non-negative integer. These integers are used as indices for embeddings, which are finally used as a part of inputs to models. The meaning of each integer is as follows.
Title | Value | Note |
---|---|---|
Room |
0 : Bronze Room (銅の間)1 : Silver Room (銀の間)2 : Gold Room (金の間)3 : Jade Room (玉の間)4 : Throne Room (王座の間) |
|
Game Style |
5 : quarter-length game (dong feng zhan, 東風戦)6 : half-length game (ban zhuang zhan, 半荘戦) |
|
Seat |
7 ~ 10
|
7 + seat |
Game Wind (Chang, 場) |
11 : East (東場)12 : South (南場)13 : West (西場) |
|
Round (Ju, 局) |
14 ~ 17
|
14 + round |
Dora Indicator |
18 ~ 54
|
18 + tile |
2nd Dora Indicator |
55 ~ 91
|
optional, 55 + tile
|
3rd Dora Indicator |
92 ~ 128
|
optional, 92 + tile
|
4th Dora Indicator |
129 ~ 165
|
optional, 129 + tile
|
5th Dora Indicator |
166 ~ 202
|
optional, 166 + tile
|
# of Left Tiles to Draw |
203 ~ 272
|
203 + (# of left tiles) |
Grade of the player indicated by Seat |
273 ~ 288
|
273 + grade |
Rank of the player indicated by Seat |
289 ~ 292
|
289 + rank |
Grade of the player right next to Seat (Seat の下家) |
293 ~ 308
|
293 + grade |
Rank of the player right next to Seat (Seat の下家) |
309 ~ 312
|
309 + rank |
Grade of the player across from Seat (Seat の対面) |
313 ~ 328
|
313 + grade |
Rank of the player across from Seat (Seat の対面) |
329 ~ 332
|
329 + rank |
Grade of the player left next to Seat (Seat の上家) |
333 ~ 348
|
333 + grade |
Rank of the player left next to Seat (Seat の上家) |
349 ~ 352
|
349 + rank |
Hand (shou pai, 手牌) |
353 ~ 488
|
(combination, see below) |
Drawn Tile (zimo pai, 自摸牌) |
489 ~ 525
|
optional, 489 + tile
|
<PADDING> | 526 |
(does not appear in annotation) |
The following is how a tile in the hand is represented:
Tile | Value |
---|---|
Red 5m | 353 |
First 1m | 354 |
Second 2m | 355 |
Third 1m | 356 |
Fourth 1m | 357 |
First 2m | 358 |
..... | ... |
First black 5m | 370 |
Second black 5m | 371 |
Third black 5m | 372 |
First 6m | 373 |
..... | ... |
Fourth 9m | 388 |
Red 5p | 389 |
First 1p | 390 |
..... | ... |
Red 5s | 425 |
First 1s | 426 |
..... | ... |
Fourth 9s | 460 |
First East | 461 |
Second East | 462 |
Third East | 463 |
Fourth East | 464 |
First South | 465 |
..... | ... |
First White Dragon (白) | 477 |
..... | ... |
Fourth Red Dragon (中) | 488 |
The 2nd field consists of numeric features. This field consists of exactly 6 elements. These features are numerically meaningful and directly used as a part of inputs to models. The meaning of each element is as follows.
Element Index | Explanation |
---|---|
0 | The number of counter sticks (ben chang, 本場) |
1 | The number of riichi deposits (供託本数) |
2 | The score of the player indicated by Seat |
3 | The score of the player right next to Seat (Seat の下家) |
4 | The score of the player across from Seat (Seat の対面) |
5 | The score of the player left next to Seat (Seat の上家) |
The 3rd field consists of progression features. This field represents a sequence of non-negative integers. Each integer stands for some event in a round of a game. The order of the integers in the sequence directly represents the order in which the events occurred until the decision-making point. These integers are used as indices for embeddings, which are finally used as a part of inputs to models. Note, however, that positional encoding must be applied to the embeddings if they are to be used as a part of inputs to models such as ones using transformer, which erase the positional/order information of the input embeddings. The meaning of each integer is as follows.
Title | Values | Note |
---|---|---|
Begging of Round | 0 |
Always starts with this feature |
Discard of Tile (打牌) |
5 ~ 596
|
5 + seat * 148 + tile * 4 + a * 2 + b , where;a = 0 : not moqi (手出し)a = 1 : moqi (自摸切り)b = 0 : w/o riichi declarationb = 1 : w/ riichi declaration |
Chow (Chi, チー, 吃) |
597 ~ 956
|
597 + seat * 90 + chi |
Pon (peng, ポン, 碰) |
957 ~ 1436
|
957 + seat * 120 + relseat * 40 + peng |
Da Ming Gang (大明槓) |
1437 ~ 1880
|
1437 + seat * 111 + relseat * 37 + tile |
An Gang (暗槓) |
1881 ~ 2016
|
1881 + seat * 34 + tile' |
Jia Gang (加槓) |
2017 ~ 2164
|
2017 + seat * 37 + tile |
<PADDING> | 2165 |
(does not appear in annotation) |
The 4th field consists of all the possible actions at that decision-making point. They are called option features.
Type of Actions | Value | Note |
---|---|---|
Discarding tile |
0 ~ 147
|
tile * 4 + a * 2 + b , where;a = 0 : not moqi (手出し)a = 1 : moqi (自摸切り)b = 0 : w/o riichi declarationb = 1 : w/ riichi declaration |
An Gang (暗槓) |
148 ~ 181
|
148 + tile' |
Jia Gang (加槓) |
182 ~ 218
|
Represented by the tile newly added to an existing peng.182 + tile
|
Zimo Hu (自摸和) | 219 |
|
Jiu Zhong Jiu Pai (九種九牌) | 220 |
|
Skip | 221 |
|
Chow (chi, チー, 吃) |
222 ~ 311
|
222 + chi |
Pon, (peng, ポン, 碰) |
312 ~ 431
|
312 + relseat * 40 + peng |
Da Ming Gang (大明槓) |
432 ~ 542
|
Represented by the discarded tile.432 + relseat * 37 + tile
|
Rong (栄和) |
543 ~ 545
|
543 : from xia jia (下家から)544 : from dui mian (対面から)545 : from shang Jia (上家から) |
<VALUE> | 546 |
(does not appear in annotation) |
<PADDING> | 547 |
(does not appear in annotation) |
The 5th field indicates the actual action chosen by the player (indicated by Seat) at that decision-making point. This field is the index to one of the possible actions enumerated in the 4th field.
The 6th field consists of some aspects of the final result of the round and game in which the decision-making point appear. This field consists of exactly 12 elements.
Element Index | Explanation |
---|---|
0 | End-of-round result from the point of view of the player indicated by Seat0 : 自家自摸和 (player's win by drawing a tile)1 : 下家自摸和 (win of Seat's right neighbor by drawing a tile)2 : 対面自摸和 (win of the player opposite to Seat by drawing a tile)3 : 上家自摸和 (win of Seat's left neighbor by drawing a tile)4 : 下家からの自家栄和 (win of Seat by declaring on a tile discarded by the right neighbor)5 : 対面からの自家栄和 (win of Seat by declaring on a tile discarded by the player across from Seat)6 : 上家からの自家栄和 (win of Seat by declaring on a tile discarded by the left neighbor)7 : 下家への放銃 (Seat's dealt-in to the right neighbor)8 : 対面への放銃 (Seat's dealt-in to the player across from Seat)9 : 上家への放銃 (Seat's dealt-in to the right neighbor)10 : 下家へ対面から横移動 (dealt-in from the player across from Seat to the right neighbor)11 : 下家へ上家から横移動 (dealt-in from the left neighbor to the right neighbor)12 : 対面へ下家から横移動 (dealt-in from the right neighbor to the player across from Seat)13 : 対面へ上家から横移動 (dealt-in from left neighbor to the player across from Seat)14 : 上家へ下家から横移動 (dealt-in from the right neighbor to the left neighbor)15 : 上家へ対面から横移動 (dealt-in from the player across from Seat to the left neightbor)16 : 荒牌平局 (不聴, no tile left without Seat's ready hand)17 : 荒牌平局 (聴牌, no tile left with Seat's ready hand)18 : 途中流局 (interruption of the round) |
1 | Round delta of the score of the player indicated by Seat |
2 | Round delta of the score of the player right next to Seat (Seat の下家) |
3 | Round delta of the score of the player across from Seat (Seat の対面) |
4 | Round delta of the score of the player left next to Seat (Seat の上家) |
5 | End-of-round ranking of the player indicated by Seat |
6 | End-of-round ranking of the player right next to Seat |
7 | End-of-round ranking of the player across from Seat |
8 | End-of-round ranking of the player left next to Seat |
9 | End-of-game ranking of the player indicated by Seat |
10 | End-of-game score of the player indicated by Seat |
11 | Game delta of grading score of the player indicated by Seat |
Roughly speaking, the training data format for offline reinforcement learning consists of a set of triplets (s, a, s') or (s, a, o), which represent state transitions from a decision-making point to either the next consecutive decision-making point or the "terminal state" of the game.
In the former, (s, a, s'), s and s' represent the situation at two consecutive decision-making points as seen from one player's perspective. From this, s is not the last decison-making point of each game for any given player. a represents the action taken by the player at s. In other words, (s, a, s') represents a state transition from s to s', from the perspective of one player.
In the latter, (s, a, o), s represents the situation at the last decision-making point from the perspective of a player in each game. Note that (s, a, o) represents the last decision-making point "from the perspective of a player", so there exist four (s, a, o) in each game of a 4-player mahjong. a represents the action taken by the player at s. In other words, (s, a) represents a state transition from s to the "terminal state" of each game, where a is the last action taken by the player in that game. o is the result of the game.
Let me describe this format in more detail. The annotation of a state transition from a decision-making point to the next consecutive decision-making point or the terminal state of the game is represented by one text line. Each line is tab-separated into either 9 or 7 fields, and each field is in turn comma-separated into elements. Lines with 9 tab-separated fields are annotations of state transitions from a decision-making point to the next consecutive decision-making point. Lines with 7 tab-separated fields are annotations of state transitions from a decision-making point to the terminal state of the game.
Each line with 9 tab-separated fields is as follows:
(FIRST SPARSE FEATURES)\t(FIRST NUMERIC FEATURES)\t(FIRST PROGRESSION FEATURES)\t(FIRST OPTION FEATURES)\t(ACTION INDEX)\t(SECOND SPARSE FEATURES)\t(SECOND NUMERIC FEATURES)\t(SECOND PROGRESSION FEATURES)\t(SECOND OPTION FEATURES)
In each line with 9 tab-separated fields, the first 4 fields (FIRST SPARSE FEATURES, FIRST NUMERIC FEATURES, FIRST PROGRESSION FEATURES, and FIRST OPTION FEATURES) represent the situation of the decision-making point before the transition, the next field (ACTION INDEX) represents the action taken by the player causing the transition, the next 4 fields (SECOND SPARSE FEATURES, SECOND NUMERIC FEATURES, SECOND PROGRESSION FEATURES, and SECOND OPTION FEATURES) represent the situation of the decision making point after the transition.
In each line with 7 tab-separated field is as follows:
(SPARSE FEATURES)\t(NUMERIC FEATURES)\t(PROGRESSION FEATURES)\t(OPTION FEATURES)\t(ACTION INDEX)\t(GAME RANK)\t(GAME SCORE)
In each line with 7 tab-separated fields, the first 4 fields (SPARSE FEATURES, NUMERIC FEATURES, PROGRESSION FEATURES, and OPTION FEATURES) represent the situation of the decision-making point before the transition, the next field (ACTION INDEX) represents the action taken by the player causing the transition to the terminal state, and the final 2 fields represent the result of the game, i.e., the final rank and score at the game end.
All the learning programs in this project assume that training data may be very huge. This includes the possibility that the training data will not fit in main memory (not GPU memory) or even on disk. Therefore, the learning programs do not put whole the training data into memory at the start time, but access the training data sequentially from the beginning as needed. This way, the learning programs consume very little main memory, no matter how large training data is. The learning programs also support the case where training data is compressed using gzip or bzip2. If the file name of training data ends with ".gz" or ".bz2", the learning programs automatically decompress the training data as they read it.
On the other hand, there is a downside to always accessing training data sequentially from the beginning, i.e., users need to shuffle training data before inputting them to a learning program. In particular, it is strongly discouraged to input annotated data created by annotate into learning programs without shuffling. This is because, in annotated data created using annotate, the annotations for each round are clustered together in a certain part of training data, and it is quite likely for very similar training samples to appear in a certain mini-batch of training. In general, training samples in machine learning are assumed to be independent and identically distributed (i.i.d.), and it is best to avoid such a bias in training samples.