Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion over shape of returns_to_go in get_batch #38

Open
DaveyBiggers opened this issue Mar 14, 2022 · 0 comments
Open

Confusion over shape of returns_to_go in get_batch #38

DaveyBiggers opened this issue Mar 14, 2022 · 0 comments

Comments

@DaveyBiggers
Copy link

Hi, I'm trying to understand the following code in gym/experiment.py/get_batch():

rtg.append(discount_cumsum(traj['rewards'][si:], gamma=1.)[:s[-1].shape[1] + 1].reshape(1, -1, 1))
if rtg[-1].shape[1] <= s[-1].shape[1]:
    rtg[-1] = np.concatenate([rtg[-1], np.zeros((1, 1, 1))], axis=1)
...
tlen = s[-1].shape[1]

( from https://github.com/kzl/decision-transformer/blob/master/gym/experiment.py#:~:text=rtg.append(discount_cumsum,1))%5D%2C%20axis%3D1) )

As far as I can understand it, it's creating a sequence of (tlen + 1) rtg values, then checking whether the sequence length is <= tlen, and padding it with an extra value if not. (I'm struggling to see how this situation will ever arise.)
A few lines later, the padding code is applied, pre-padding with 0s to make sure everything is length max_len, except for rtg, which will now be length max_len + 1.

I don't understand the purpose of this extra value, especially since it seems to get stripped anyway by the SequenceTrainer:

state_preds, action_preds, reward_preds = self.model.forward(
    states, actions, rewards, rtg[:,:-1], timesteps, attention_mask=attention_mask,
)

Am I missing something?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant