[v2]Reward Plug‐in

Jump to bottom

Cryolite edited this page Apr 26, 2024 · 3 revisions

In all the training programs for reinforcement learning of this repository, users MUST set up a Python source code called a reward plug-in, which allows for extremely flexible design of rewards in reinforcement learning. Reward is one of the most important elements in determining the direction of a reinforcement learning model. The fact that users can flexibly design rewards means that they essentially have the freedom to determine the direction of the model.

The only requirement for a reward plug-in is to define and implement the get_reward function with the following signature:

from tensordict import TensorDict

def get_reward(data: TensorDict, contiguous: bool) -> None:
    .....

This function is called from reinforcement learning training programs for each mini-batch. Note that the "mini-batch" mentioned here is a collection of training examples specifically constructed for the calculation of rewards, and it is different from the mini-batches actually used as input to the model. Therefore, the size of this mini-batch can differ from that of the training mini-batches, and it can even change with each call to the get_reward function.

The data parameter is an object of the type TensorDict, consisting of a mini-batch. The data parameter is a dictionary-like object where all the keys are strings or tuples of strings, and all the values are of the type Tensor. The contiguous parameter indicates whether the mini-batch is contiguous (see below for details).

Below, detailed descriptions of each parameter are provided. Note that they are not a typical API documentation that explains the methods defined for each parameter and their meanings. Instead, this API documentation explains important syntaxes defined for each parameter.

`int(data.size(0))`

This is an IN parameter. This value is equal to the size of the mini-batch.

`data["sparse"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["sparse"], torch.Tensor),
assert data["sparse"].device == torch.device("cpu"),
assert data["sparse"].dtype == torch.int32,
assert data["sparse"].dim() == 2,
assert int(data["sparse"].size(0)) == int(data.size(0)), and
assert int(data["sparse"].size(1)) == kanachan.constants.MAX_NUM_ACTIVE_SPARSE_FEATURES.

data["sparse"][i] is a vector (one-dimensional tensor) consisting of the sparse features of the state immediately before the action of the i-th training example in the mini-batch, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_SPARSE_FEATURES.

`data["numeric"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["numeric"], torch.Tensor),
assert data["numeric"].device == torch.device("cpu"),
assert data["numeric"].dtype == torch.int32,
assert data["numeric"].dim() == 2,
assert int(data["numeric"].size(0)) == int(data.size(0)), and
assert int(data["numeric"].size(1)) == kanachan.constants.NUM_NUMERIC_FEATURES.

data["numeric"][i] is a vector (one-dimensional tensor) consisting of the numeric features of the state immediately before the action of the i-th training example in the mini-batch.

`data["progression"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["progression"], torch.Tensor),
assert data["progression"].device == torch.device("cpu"),
assert data["progression"].dtype == torch.int32,
assert data["progression"].dim() == 2,
assert int(data["progression"].size(0)) == int(data.size(0)), and
assert int(data["progression"].size(1)) == kanachan.constants.MAX_LENGTH_OF_PROGRESSION_FEATURES.

data["progression"][i] is a vector (one-dimensional tensor) consisting of the progression features of the state immediately before the action of the i-th training example in the mini-batch, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_PROGRESSION_FEATURES.

`data["candidates"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["candidates"], torch.Tensor),
assert data["candidates"].device == torch.device("cpu"),
assert data["candidates"].dtype == torch.int32,
assert data["candidates"].dim() == 2,
assert int(data["candidates"].size(0)) == int(data.size(0)), and
assert int(data["candidates"].size(1)) == kanachan.constants.MAX_NUM_ACTION_CANDIDATES.

data["candidates"][i] is a vector (one-dimensional tensor) consisting of the candidate features of the state immediately before the action of the i-th training example in the mini-batch, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_ACTION_CANDIDATES.

`data["action"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["action"], torch.Tensor),
assert data["action"].device == torch.device("cpu"),
assert data["action"].dtype == torch.int32,
assert data["action"].dim() == 1, and
assert int(data["action"].size(0)) == int(data.size(0)).

data["action"][i] indicates the action of i-th training example in the mini-batch. The action is represented by the index of one of the candidate features (i.e., data["candidates"]).

`data["next", "sparse"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "sparse"], torch.Tensor),
assert data["next", "sparse"].device == torch.device("cpu"),
assert data["next", "sparse"].dtype == torch.int32,
assert data["next", "sparse"].dim() == 2,
assert int(data["next", "sparse"].size(0)) == int(data.size(0)), and
assert int(data["next", "sparse"].size(1)) == kanachan.constants.MAX_NUM_ACTIVE_SPARSE_FEATURES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "sparse"][i] is a vector (one-dimensional tensor) consisting of the sparse features of the state immediately after that action, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_SPARSE_FEATURES. Otherwise, data["next", "sparse"][i] is filled with kanachan.constants.NUM_TYPES_OF_SPARSE_FEATURES.

`data["next", "numeric"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "numeric"], torch.Tensor),
assert data["next", "numeric"].device == torch.device("cpu"),
assert data["next", "numeric"].dtype == torch.int32,
assert data["next", "numeric"].dim() == 2,
assert int(data["next", "numeric"].size(0)) == int(data.size(0)), and
assert int(data["next", "numeric"].size(1)) == kanachan.constants.NUM_NUMERIC_FEATURES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "numeric"][i] is a vector (one-dimensional tensor) consisting of the numeric features of the state immediately after that action. Otherwise, data["next", "numeric"][i] is filled with 0.

`data["next", "progression"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "progression"], torch.Tensor),
assert data["next", "progression"].device == torch.device("cpu"),
assert data["next", "progression"].dtype == torch.int32,
assert data["next", "progression"].dim() == 2,
assert int(data["next", "progression"].size(0)) == int(data.size(0)), and
assert int(data["next", "progression"].size(1)) == kanachan.constants.MAX_LENGTH_OF_PROGRESSION_FEATURES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "progression"][i] is a vector (one-dimensional tensor) consisting of the progression features of the state immediately after that action, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_PROGRESSION_FEATURES. Otherwise, data["next", "progression"][i] is filled with kanachan.constants.NUM_TYPES_OF_PROGRESSION_FEATURES.

`data["next", "candidates"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "candidates"], torch.Tensor),
assert data["next", "candidates"].device == torch.device("cpu"),
assert data["next", "candidates"].dtype == torch.int32,
assert data["next", "candidates"].dim() == 2,
assert int(data["next", "candidates"].size(0)) == int(data.size(0)), and
assert int(data["next", "candidates"].size(1)) == kanachan.constants.MAX_NUM_ACTION_CANDIDATES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "candidates"][i] is a vector (one-dimensional tensor) consisting of the candidate features of the state immediately after that action, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_ACTION_CANDIDATES. Otherwise, data["next", "candidates"][i] is filled with kanachan.constants.NUM_TYPES_OF_ACTION_CANDIDATES.

`data["next", "round_summary"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "round_summary"], torch.Tensor),
assert data["next", "round_summary"].device == torch.device("cpu"),
assert data["next", "round_summary"].dtype == torch.int32,
assert data["next", "round_summary"].dim() == 2,
assert int(data["next", "round_summary"].size(0)) == int(data.size(0)), and
assert int(data["next", "round_summary"].size(1)) == kanachan.constants.MAX_NUM_ROUND_SUMMARY.

If the i-th training example in the mini-batch corresponds to the last action of a player in a round (that is, data["next", "end_of_round"].item() == True), data["next", "round_summary"][i] is a vector (one-dimensional tensor) consisting of the summary of that round, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_ROUND_SUMMARY. Otherwise, data["next", "round_summary"][i] is filled with kanachan.constants.NUM_TYPES_OF_ROUND_SUMMARY.

`data["next", "results"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "results"], torch.Tensor),
assert data["next", "results"].device == torch.device("cpu"),
assert data["next", "results"].dtype == torch.int32,
assert data["next", "results"].dim() == 2,
assert int(data["next", "results"].size(0)) == int(data.size(0)), and
assert int(data["next", "results"].size(1)) == kanachan.constants.RL_NUM_RESULTS.

If the i-th training example in the mini-batch corresponds to the last action of a player in a round (that is, data["next", "end_of_round"].item() == True), data["next", "results"][i] is a vector (one-dimensional tensor) consisting of the results of that round. Otherwise, data["next", "round_summary"][i] is filled with 0.

`data["next", "end_of_round"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "end_of_round"], torch.Tensor),
assert data["next", "end_of_round"].device == torch.device("cpu"),
assert data["next", "end_of_round"].dtype == torch.bool,
assert data["next", "end_of_round"].dim() == 1, and
assert int(data["next", "end_of_round"].size(0)) == int(data.size(0)).

data["next", "end_of_round"][i] indicates whether the i-th training example in the mini-batch corresponds to the last action of a player in a round.

`data["next", "end_of_game"]`

This is an IN parameter. This parameter will pass all of the following assert statements:

assert isinstance(data["next", "end_of_game"], torch.Tensor),
assert data["next", "end_of_game"].device == torch.device("cpu"),
assert data["next", "end_of_game"].dtype == torch.bool,
assert data["next", "end_of_game"].dim() == 1, and
assert int(data["next", "end_of_game"].size(0)) == int(data.size(0)).

data["next", "end_of_game"][i] indicates whether the i-th training example in the mini-batch corresponds to the last action of a player in a game.

`data["next", "done"]`

This is an IN-OUT parameter. This paramter will and SHOULD pass all of the following assertstatements:

assert isinstance(data["next", "done"], torch.Tensor),
assert data["next", "done"].device == torch.device("cpu"),
assert data["next", "done"].dtype == torch.bool,
assert data["next", "done"].dim() == 1, and
assert int(data["next", "done"].size(0)) == int(data.size(0)).

The caller of get_reward guarantees the pre-condition data["next", "done"] == data["next", "end_of_game"]. data["next", "done"][i].item() == True indicates that the i-th training example in the mini-batch should be interpreted as "the last step of a trajectory." Therefore, in the default state where the get_reward function does not modify data["next", "done"], the last action of a player in each game is interpreted as the last step of a trajectory. In other words, the sequence of actions performed by a player in each game will then constitute one trajectory. Furthermore, if you write in the get_reward function data["next", "done"] = data["next", "end_of_round"].detach().clone(), it will be interpreted that the last action of a player in each round is the last step of a trajectory. In other words, the sequence of actions performed by a player in each round will then constitute one trajectory.

`data["next", "reward"]`

This is an OUT parameter. The get_reward function SHOULD guarantee the following post-conditions:

assert isinstance(data["next", "reward"], torch.Tensor),
assert data["next", "reward"].device == torch.device("cpu"),
assert data["next", "reward"].dtype in (torch.float64, torch.float32, torch.float16),
assert data["next", "reward"].dim() == 1, and
assert int(data["next", "reward"].size(0)) == int(data.size(0)).

Additionally, to stabilize training processes, it is desirable to keep the mean of data["next", "reward"] as close to zero and the standard deviation as close to one as possible. The value of data["next", "reward"][i] will be interpreted as the reward for the action of the i-th training example in the mini-batch.

`contiguous`

This is an IN parameter. This parameter indicates whether the data parameter is contiguous. The data paramter is said contiguous if all the folloing conditions are met:

data contains all training examples corresponding to the actions performed by a player in a particular game.
The mini-batch represented by data is organized in the order in which the actions occurred.

Examples

End-of-Game Raw Points

from torch import Tensor
from tensordict import TensorDict

def get_reward(data: TensorDict, contiguous: bool) -> None:
    batch_size = int(data.size(0))
    sparse: Tensor = data["sparse"]
    results: Tensor = data["next", "results"]
    end_of_game: Tensor = data["next", "end_of_game"]
    data["next", "reward"] = torch.zeros_like(end_of_game, dtype=torch.float64)

    for i in range(batch_size):
        if not end_of_game[i].item():
            continue

        seat = int(sparse[i, 6].item()) - 71
        scores: list[int] = results[i, 4:8].tolist()
        score = scores[seat]

        # Calculate the mean (`_SCORE_MEAN`) and standard deviation (`_SCORE_STDEV`) from the training data in advance.
        data["next", "reward"][i] = (score - _SCORE_MEAN) / _SCORE_STDEV

End-of-Game Ranking

import torch
from torch import Tensor
from tensordict import TensorDict

_REWARD_BY_RANKING = [.....] # See below.
_REWARD = torch.tensor(_REWARD_BY_RANKING, device="cpu", dtype=torch.float64)
_REWARD_MEAN = _REWARD.mean()
_REWARD_STDEV = _REWARD.stdev()

def get_reward(data: TensorDict, contiguous: bool) -> None:
    batch_size = int(data.size(0))
    sparse: Tensor = data["sparse"]
    results: Tensor = data["next", "results"]
    end_of_game: Tensor = data["next", "end_of_game"]
    data["next", "reward"] = torch.zeros_like(end_of_game, dtype=torch.float64)

    for i in range(batch_size):
        if not end_of_game[i].item():
            continue

        seat = int(sparse[i, 6].item()) - 71
        scores: list[int] = results[i, 4:8].tolist()
        score = scores[seat]
        ranking = 0
        for j in range(seat):
            if scores[j] >= score:
                ranking += 1
        for j in range(seat + 1, 4):
            if scores[j] > score:
                ranking += 1

        data["next", "reward"][i] = (_REWARD_BY_RANKING[ranking] - _REWARD_MEAN) / _REWARD_STDEV

The code template shown above can be instantiated with the following examples:

Top: _REWARD_BY_RANKING = [1.0, -1.0, -1.0, -1.0]
Top two: _REWARD_BY_RANKING = [1.0, 1.0, -1.0, -1.0]
Tenhou, Tokujo-taku, Half-length Game, 6-dan (天鳳，特上卓，東南戦，六段): _REWARD_BY_RANKING = [75.0, 30.0, 0.0, -120.0]
MahjongSoul, Throne Room, Half-length Game, Celestial (雀魂，王座の間，半荘戦，魂天): _REWARD_BY_RANKING = [1.0, 0.4, -0.4, -1.0]

End-of-Game Ranking + Raw Points

from torch import Tensor
from tensordict import TensorDict

_REWARD_BY_RANKING = [.....] # See below.

def get_reward(data: TensorDict, contiguous: bool) -> None:
    batch_size = int(data.size(0))
    sparse: Tensor = data["sparse"]
    results: Tensor = data["next", "results"]
    end_of_game: Tensor = data["next", "end_of_game"]
    data["next", "reward"] = torch.zeros_like(end_of_game, dtype=torch.float64)

    for i in range(batch_size):
        if not end_of_game[i].item():
            continue

        seat = int(sparse[i, 6].item()) - 71
        scores: list[int] = results[i, 4:8].tolist()
        score = scores[seat]
        ranking = 0
        for j in range(seat):
            if scores[j] >= score:
                ranking += 1
        for j in range(seat + 1, 4):
            if scores[j] > score:
                ranking += 1

        reward = ..... # See below.
        # Calculate the mean (`_REWARD_MEAN`) and standard deviation (`_REWARD_STDEV`) from the training data in advance.
        data["next", "reward"][i] = (reward - _REWARD_MEAN) / _REWARD_STDEV

The code template shown above can be instantiated with the following examples:

MahjongSoul, Jade Room, Half-length Game, Saint 3 (雀魂，玉の間，半荘戦，雀聖3):
- _REWARD_BY_RANKING = [135.0, 65.0, -5.0, -255.0]
- reward = _REWARD_BY_RANKING[ranking] + (score - 25000) // 1000
M League:
- _REWARD_BY_RANKING = [500.0, 100.0, -100.0, -300.0]
- reward = _REWARD_BY_RANKING[ranking] + (score - 30000) // 100