Skip to content

[v2]Reward Plug‐in

Cryolite edited this page Apr 26, 2024 · 3 revisions

In all the training programs for reinforcement learning of this repository, users MUST set up a Python source code called a reward plug-in, which allows for extremely flexible design of rewards in reinforcement learning. Reward is one of the most important elements in determining the direction of a reinforcement learning model. The fact that users can flexibly design rewards means that they essentially have the freedom to determine the direction of the model.

The only requirement for a reward plug-in is to define and implement the get_reward function with the following signature:

from tensordict import TensorDict

def get_reward(data: TensorDict, contiguous: bool) -> None:
    .....

This function is called from reinforcement learning training programs for each mini-batch. Note that the "mini-batch" mentioned here is a collection of training examples specifically constructed for the calculation of rewards, and it is different from the mini-batches actually used as input to the model. Therefore, the size of this mini-batch can differ from that of the training mini-batches, and it can even change with each call to the get_reward function.

The data parameter is an object of the type TensorDict, consisting of a mini-batch. The data parameter is a dictionary-like object where all the keys are strings or tuples of strings, and all the values are of the type Tensor. The contiguous parameter indicates whether the mini-batch is contiguous (see below for details).

Below, detailed descriptions of each parameter are provided. Note that they are not a typical API documentation that explains the methods defined for each parameter and their meanings. Instead, this API documentation explains important syntaxes defined for each parameter.

int(data.size(0))

This is an IN parameter. This value is equal to the size of the mini-batch.

data["sparse"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["sparse"], torch.Tensor),
  • assert data["sparse"].device == torch.device("cpu"),
  • assert data["sparse"].dtype == torch.int32,
  • assert data["sparse"].dim() == 2,
  • assert int(data["sparse"].size(0)) == int(data.size(0)), and
  • assert int(data["sparse"].size(1)) == kanachan.constants.MAX_NUM_ACTIVE_SPARSE_FEATURES.

data["sparse"][i] is a vector (one-dimensional tensor) consisting of the sparse features of the state immediately before the action of the i-th training example in the mini-batch, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_SPARSE_FEATURES.

data["numeric"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["numeric"], torch.Tensor),
  • assert data["numeric"].device == torch.device("cpu"),
  • assert data["numeric"].dtype == torch.int32,
  • assert data["numeric"].dim() == 2,
  • assert int(data["numeric"].size(0)) == int(data.size(0)), and
  • assert int(data["numeric"].size(1)) == kanachan.constants.NUM_NUMERIC_FEATURES.

data["numeric"][i] is a vector (one-dimensional tensor) consisting of the numeric features of the state immediately before the action of the i-th training example in the mini-batch.

data["progression"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["progression"], torch.Tensor),
  • assert data["progression"].device == torch.device("cpu"),
  • assert data["progression"].dtype == torch.int32,
  • assert data["progression"].dim() == 2,
  • assert int(data["progression"].size(0)) == int(data.size(0)), and
  • assert int(data["progression"].size(1)) == kanachan.constants.MAX_LENGTH_OF_PROGRESSION_FEATURES.

data["progression"][i] is a vector (one-dimensional tensor) consisting of the progression features of the state immediately before the action of the i-th training example in the mini-batch, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_PROGRESSION_FEATURES.

data["candidates"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["candidates"], torch.Tensor),
  • assert data["candidates"].device == torch.device("cpu"),
  • assert data["candidates"].dtype == torch.int32,
  • assert data["candidates"].dim() == 2,
  • assert int(data["candidates"].size(0)) == int(data.size(0)), and
  • assert int(data["candidates"].size(1)) == kanachan.constants.MAX_NUM_ACTION_CANDIDATES.

data["candidates"][i] is a vector (one-dimensional tensor) consisting of the candidate features of the state immediately before the action of the i-th training example in the mini-batch, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_ACTION_CANDIDATES.

data["action"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["action"], torch.Tensor),
  • assert data["action"].device == torch.device("cpu"),
  • assert data["action"].dtype == torch.int32,
  • assert data["action"].dim() == 1, and
  • assert int(data["action"].size(0)) == int(data.size(0)).

data["action"][i] indicates the action of i-th training example in the mini-batch. The action is represented by the index of one of the candidate features (i.e., data["candidates"]).

data["next", "sparse"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "sparse"], torch.Tensor),
  • assert data["next", "sparse"].device == torch.device("cpu"),
  • assert data["next", "sparse"].dtype == torch.int32,
  • assert data["next", "sparse"].dim() == 2,
  • assert int(data["next", "sparse"].size(0)) == int(data.size(0)), and
  • assert int(data["next", "sparse"].size(1)) == kanachan.constants.MAX_NUM_ACTIVE_SPARSE_FEATURES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "sparse"][i] is a vector (one-dimensional tensor) consisting of the sparse features of the state immediately after that action, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_SPARSE_FEATURES. Otherwise, data["next", "sparse"][i] is filled with kanachan.constants.NUM_TYPES_OF_SPARSE_FEATURES.

data["next", "numeric"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "numeric"], torch.Tensor),
  • assert data["next", "numeric"].device == torch.device("cpu"),
  • assert data["next", "numeric"].dtype == torch.int32,
  • assert data["next", "numeric"].dim() == 2,
  • assert int(data["next", "numeric"].size(0)) == int(data.size(0)), and
  • assert int(data["next", "numeric"].size(1)) == kanachan.constants.NUM_NUMERIC_FEATURES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "numeric"][i] is a vector (one-dimensional tensor) consisting of the numeric features of the state immediately after that action. Otherwise, data["next", "numeric"][i] is filled with 0.

data["next", "progression"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "progression"], torch.Tensor),
  • assert data["next", "progression"].device == torch.device("cpu"),
  • assert data["next", "progression"].dtype == torch.int32,
  • assert data["next", "progression"].dim() == 2,
  • assert int(data["next", "progression"].size(0)) == int(data.size(0)), and
  • assert int(data["next", "progression"].size(1)) == kanachan.constants.MAX_LENGTH_OF_PROGRESSION_FEATURES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "progression"][i] is a vector (one-dimensional tensor) consisting of the progression features of the state immediately after that action, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_PROGRESSION_FEATURES. Otherwise, data["next", "progression"][i] is filled with kanachan.constants.NUM_TYPES_OF_PROGRESSION_FEATURES.

data["next", "candidates"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "candidates"], torch.Tensor),
  • assert data["next", "candidates"].device == torch.device("cpu"),
  • assert data["next", "candidates"].dtype == torch.int32,
  • assert data["next", "candidates"].dim() == 2,
  • assert int(data["next", "candidates"].size(0)) == int(data.size(0)), and
  • assert int(data["next", "candidates"].size(1)) == kanachan.constants.MAX_NUM_ACTION_CANDIDATES.

If the i-th training example in the mini-batch does not correspond to the last action of a player in a game (that is, data["next", "end_of_game"].item() == False), data["next", "candidates"][i] is a vector (one-dimensional tensor) consisting of the candidate features of the state immediately after that action, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_ACTION_CANDIDATES. Otherwise, data["next", "candidates"][i] is filled with kanachan.constants.NUM_TYPES_OF_ACTION_CANDIDATES.

data["next", "round_summary"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "round_summary"], torch.Tensor),
  • assert data["next", "round_summary"].device == torch.device("cpu"),
  • assert data["next", "round_summary"].dtype == torch.int32,
  • assert data["next", "round_summary"].dim() == 2,
  • assert int(data["next", "round_summary"].size(0)) == int(data.size(0)), and
  • assert int(data["next", "round_summary"].size(1)) == kanachan.constants.MAX_NUM_ROUND_SUMMARY.

If the i-th training example in the mini-batch corresponds to the last action of a player in a round (that is, data["next", "end_of_round"].item() == True), data["next", "round_summary"][i] is a vector (one-dimensional tensor) consisting of the summary of that round, and the end of the vector is padded with kanachan.constants.NUM_TYPES_OF_ROUND_SUMMARY. Otherwise, data["next", "round_summary"][i] is filled with kanachan.constants.NUM_TYPES_OF_ROUND_SUMMARY.

data["next", "results"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "results"], torch.Tensor),
  • assert data["next", "results"].device == torch.device("cpu"),
  • assert data["next", "results"].dtype == torch.int32,
  • assert data["next", "results"].dim() == 2,
  • assert int(data["next", "results"].size(0)) == int(data.size(0)), and
  • assert int(data["next", "results"].size(1)) == kanachan.constants.RL_NUM_RESULTS.

If the i-th training example in the mini-batch corresponds to the last action of a player in a round (that is, data["next", "end_of_round"].item() == True), data["next", "results"][i] is a vector (one-dimensional tensor) consisting of the results of that round. Otherwise, data["next", "round_summary"][i] is filled with 0.

data["next", "end_of_round"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "end_of_round"], torch.Tensor),
  • assert data["next", "end_of_round"].device == torch.device("cpu"),
  • assert data["next", "end_of_round"].dtype == torch.bool,
  • assert data["next", "end_of_round"].dim() == 1, and
  • assert int(data["next", "end_of_round"].size(0)) == int(data.size(0)).

data["next", "end_of_round"][i] indicates whether the i-th training example in the mini-batch corresponds to the last action of a player in a round.

data["next", "end_of_game"]

This is an IN parameter. This parameter will pass all of the following assert statements:

  • assert isinstance(data["next", "end_of_game"], torch.Tensor),
  • assert data["next", "end_of_game"].device == torch.device("cpu"),
  • assert data["next", "end_of_game"].dtype == torch.bool,
  • assert data["next", "end_of_game"].dim() == 1, and
  • assert int(data["next", "end_of_game"].size(0)) == int(data.size(0)).

data["next", "end_of_game"][i] indicates whether the i-th training example in the mini-batch corresponds to the last action of a player in a game.

data["next", "done"]

This is an IN-OUT parameter. This paramter will and SHOULD pass all of the following assertstatements:

  • assert isinstance(data["next", "done"], torch.Tensor),
  • assert data["next", "done"].device == torch.device("cpu"),
  • assert data["next", "done"].dtype == torch.bool,
  • assert data["next", "done"].dim() == 1, and
  • assert int(data["next", "done"].size(0)) == int(data.size(0)).

The caller of get_reward guarantees the pre-condition data["next", "done"] == data["next", "end_of_game"]. data["next", "done"][i].item() == True indicates that the i-th training example in the mini-batch should be interpreted as "the last step of a trajectory." Therefore, in the default state where the get_reward function does not modify data["next", "done"], the last action of a player in each game is interpreted as the last step of a trajectory. In other words, the sequence of actions performed by a player in each game will then constitute one trajectory. Furthermore, if you write in the get_reward function data["next", "done"] = data["next", "end_of_round"].detach().clone(), it will be interpreted that the last action of a player in each round is the last step of a trajectory. In other words, the sequence of actions performed by a player in each round will then constitute one trajectory.

data["next", "reward"]

This is an OUT parameter. The get_reward function SHOULD guarantee the following post-conditions:

  • assert isinstance(data["next", "reward"], torch.Tensor),
  • assert data["next", "reward"].device == torch.device("cpu"),
  • assert data["next", "reward"].dtype in (torch.float64, torch.float32, torch.float16),
  • assert data["next", "reward"].dim() == 1, and
  • assert int(data["next", "reward"].size(0)) == int(data.size(0)).

Additionally, to stabilize training processes, it is desirable to keep the mean of data["next", "reward"] as close to zero and the standard deviation as close to one as possible. The value of data["next", "reward"][i] will be interpreted as the reward for the action of the i-th training example in the mini-batch.

contiguous

This is an IN parameter. This parameter indicates whether the data parameter is contiguous. The data paramter is said contiguous if all the folloing conditions are met:

  • data contains all training examples corresponding to the actions performed by a player in a particular game.
  • The mini-batch represented by data is organized in the order in which the actions occurred.

Examples

End-of-Game Raw Points

from torch import Tensor
from tensordict import TensorDict

def get_reward(data: TensorDict, contiguous: bool) -> None:
    batch_size = int(data.size(0))
    sparse: Tensor = data["sparse"]
    results: Tensor = data["next", "results"]
    end_of_game: Tensor = data["next", "end_of_game"]
    data["next", "reward"] = torch.zeros_like(end_of_game, dtype=torch.float64)

    for i in range(batch_size):
        if not end_of_game[i].item():
            continue

        seat = int(sparse[i, 6].item()) - 71
        scores: list[int] = results[i, 4:8].tolist()
        score = scores[seat]

        # Calculate the mean (`_SCORE_MEAN`) and standard deviation (`_SCORE_STDEV`) from the training data in advance.
        data["next", "reward"][i] = (score - _SCORE_MEAN) / _SCORE_STDEV

End-of-Game Ranking

import torch
from torch import Tensor
from tensordict import TensorDict

_REWARD_BY_RANKING = [.....] # See below.
_REWARD = torch.tensor(_REWARD_BY_RANKING, device="cpu", dtype=torch.float64)
_REWARD_MEAN = _REWARD.mean()
_REWARD_STDEV = _REWARD.stdev()

def get_reward(data: TensorDict, contiguous: bool) -> None:
    batch_size = int(data.size(0))
    sparse: Tensor = data["sparse"]
    results: Tensor = data["next", "results"]
    end_of_game: Tensor = data["next", "end_of_game"]
    data["next", "reward"] = torch.zeros_like(end_of_game, dtype=torch.float64)

    for i in range(batch_size):
        if not end_of_game[i].item():
            continue

        seat = int(sparse[i, 6].item()) - 71
        scores: list[int] = results[i, 4:8].tolist()
        score = scores[seat]
        ranking = 0
        for j in range(seat):
            if scores[j] >= score:
                ranking += 1
        for j in range(seat + 1, 4):
            if scores[j] > score:
                ranking += 1

        data["next", "reward"][i] = (_REWARD_BY_RANKING[ranking] - _REWARD_MEAN) / _REWARD_STDEV

The code template shown above can be instantiated with the following examples:

  • Top: _REWARD_BY_RANKING = [1.0, -1.0, -1.0, -1.0]
  • Top two: _REWARD_BY_RANKING = [1.0, 1.0, -1.0, -1.0]
  • Tenhou, Tokujo-taku, Half-length Game, 6-dan (天鳳,特上卓,東南戦,六段): _REWARD_BY_RANKING = [75.0, 30.0, 0.0, -120.0]
  • MahjongSoul, Throne Room, Half-length Game, Celestial (雀魂,王座の間,半荘戦,魂天): _REWARD_BY_RANKING = [1.0, 0.4, -0.4, -1.0]

End-of-Game Ranking + Raw Points

from torch import Tensor
from tensordict import TensorDict

_REWARD_BY_RANKING = [.....] # See below.

def get_reward(data: TensorDict, contiguous: bool) -> None:
    batch_size = int(data.size(0))
    sparse: Tensor = data["sparse"]
    results: Tensor = data["next", "results"]
    end_of_game: Tensor = data["next", "end_of_game"]
    data["next", "reward"] = torch.zeros_like(end_of_game, dtype=torch.float64)

    for i in range(batch_size):
        if not end_of_game[i].item():
            continue

        seat = int(sparse[i, 6].item()) - 71
        scores: list[int] = results[i, 4:8].tolist()
        score = scores[seat]
        ranking = 0
        for j in range(seat):
            if scores[j] >= score:
                ranking += 1
        for j in range(seat + 1, 4):
            if scores[j] > score:
                ranking += 1

        reward = ..... # See below.
        # Calculate the mean (`_REWARD_MEAN`) and standard deviation (`_REWARD_STDEV`) from the training data in advance.
        data["next", "reward"][i] = (reward - _REWARD_MEAN) / _REWARD_STDEV

The code template shown above can be instantiated with the following examples:

  • MahjongSoul, Jade Room, Half-length Game, Saint 3 (雀魂,玉の間,半荘戦,雀聖3):
    • _REWARD_BY_RANKING = [135.0, 65.0, -5.0, -255.0]
    • reward = _REWARD_BY_RANKING[ranking] + (score - 25000) // 1000
  • M League:
    • _REWARD_BY_RANKING = [500.0, 100.0, -100.0, -300.0]
    • reward = _REWARD_BY_RANKING[ranking] + (score - 30000) // 100