Unable to reproduce result for validation set #40

sgdgp · 2020-07-27T23:02:57Z

Hi,
Thanks for the amazing dataset and for sharing your code.
I am unable to reproduce the results for validation set seen.
I downloaded the checkpoints as provided by you and I am using the best_seen.pth
I am getting SR 0.0097 and GC 0.0659 whereas the result on val seen in the paper is SR 0.037 and GC 0.1.

Could you point to any stuff I might have missed ?

For starting XServer I used
sudo nvidia-xconfig -a --use-display-device=None --virtual=1024x786
sudo /usr/bin/X :0 &

I face two warnings
UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn("Default upsampling behavior when mode={} is changed "

UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")

The second warning won't affect the results but I wanted to confirm if upsampling with align corners was intended or whether the warning appeared earlier too and I should ignore it ?

The text was updated successfully, but these errors were encountered:

MohitShridhar · 2020-07-27T23:43:07Z

Hi @sgdgp, some users have reported that they had to train the model themselves instead of using the pretrained models. I still haven't figured out the source of this issue, but it seems like only certain users are affected by this. I'll mention it in the FAQ.

You might have better luck with the Docker setup.

sgdgp · 2020-07-29T13:51:56Z

Thanks @MohitShridhar . Also the training is with decoder teacher forcing enabled right ?

MohitShridhar · 2020-07-29T16:01:47Z

Ah no, leave it to the default False. You can use the settings specified in the training example.

sgdgp · 2020-07-29T17:20:12Z

Oh I see. Thanks!

MohitShridhar · 2020-08-05T22:41:47Z

Not sure if this is causing the issue, but check that your versions of torch and torchvision are consistent with requirements.txt

sgdgp · 2020-08-17T04:10:37Z

I am trying with the dockerfile. I will update on the status soon.

IgorDroz · 2020-10-06T19:51:56Z

@sgdgp Have you managed to reproduce the paper's results after all?

@MohitShridhar How did you choose the model that produces the results in your paper?
I tried both - best_seen and best_unseen and both perform worse.

MohitShridhar · 2020-10-06T22:07:21Z

@IgorDroz we picked the best_seen model.

You can try training the model yourself (from scratch) if the problem persists.

IgorDroz · 2020-10-24T09:02:23Z

@MohitShridhar

I trained from scratch and got (valid_seen)
SR: 12/820 = 0.015
GC: 172/2109 = 0.082
PLW SR: 0.011
PLW GC: 0.072

while in the paper you achieved :
SR: 0.032
GC: 0.1
PLW SR: 0.021
PLW GC: 0.07

The only difference between me and you is the initialization but still you got x2 better results in SR..

Additionally, i wanted to ask you regarding the testing. Is it only done via submission? or since the challenge has finished, will you be able to release a code with the actual GT of the test?

Thanks,
Igor

MohitShridhar · 2020-10-24T23:09:02Z

The only difference between me and you is the initialization but still you got x2 better results in SR..

Sorry, what's the initialization difference? And also, is this inside a Docker container?

or since the challenge has finished, will you be able to release a code with the actual GT of the test?

No. The leaderboard is a perpetual benchmark for ALFRED. As with any benchmark in the community, the test set will remain a secret to prevent cheating/overfitting. To evaluate on the test set, use the leaderboard submission.

IgorDroz · 2020-10-25T06:54:00Z

@MohitShridhar The initialization of the neural net, the initial weights. And no, it is not inside a docker.

MohitShridhar · 2020-10-26T01:23:31Z

@IgorDroz can you report your torch and torchvision versions along with CUDA and GPU specs? Also, which resnet checkpoint are you using from torchvision?

IgorDroz · 2020-10-28T09:20:42Z

@MohitShridhar
torch==1.1.0
torchvision==0.3.0
CUDA Version: 11.1
GPU is Tesla K80
nvidia Driver Version: 455.23.05

How can i check the resnet checkpoint?

MohitShridhar · 2020-10-28T17:21:31Z

@IgorDroz, it's usually inside $HOME/.cache/torch/checkpoints/. I am using resnet34-333f7ec4.pth.

IgorDroz · 2020-11-28T17:18:24Z

@IgorDroz, it's usually inside $HOME/.cache/torch/checkpoints/. I am using resnet34-333f7ec4.pth.

@MohitShridhar Sorry for the late answer, so probably this is the difference, i use resnet18-5c106cde.pth. Now it makes sense, thanks!

MohitShridhar · 2020-11-28T23:14:22Z

Oops, sorry. I just checked again. I am also using resnet18-5c106cde.pth, so it's probably not the issue.

The next thing to try would be run this inside docker to make sure the setup is exactly the same.

IgorDroz · 2021-01-18T08:12:56Z

@MohitShridhar Hi again,

Just saw your answer. yet i am not able to reproduce your results, docker shouldn't really matter as the environment is the same and i should be able to get similar results to yours...

a recap of what i tried and what i got:

I used your pre-trained model (https://github.com/askforalfred/alfred/tree/master/models#pre-trained-model) and ran evaluation.
The results are:
SR: 8/820 = 0.01
GC: 143/2109 = 0.068
PLW SR: 0.003
PLW GC: 0.038

Which results did you achieve with this model? because they are pretty far from what you have reported in the paper:
SR: 0.032
GC: 0.1
PLW SR: 0.021
PLW GC: 0.07

i also trained from scratch and got:
SR: 8/820 = 0.01
GC: 143/2109 = 0.068
PLW SR: 0.007
PLW GC: 0.049
(which is quite similar to the results i got using your pretrained model)

this time i used P100 GPU like you, yet the results are different.
How can it be? i will attach my packages:

ai2thor==2.1.0
cached-property==1.5.2
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
cycler==0.10.0
decorator==4.4.2
Flask==1.1.2
h5py==3.1.0
idna==2.10
itsdangerous==1.1.0
Jinja2==2.11.2
kiwisolver==1.3.1
MarkupSafe==1.1.1
matplotlib==3.3.3
networkx==2.5
numpy==1.19.5
opencv-python==4.5.1.48
pandas==1.2.0
Pillow==8.1.0
progressbar2==3.53.1
protobuf==3.14.0
pyparsing==2.4.7
python-dateutil==2.8.1
python-utils==2.4.0
pytz==2020.5
PyYAML==5.3.1
requests==2.25.1
revtok==0.0.3
six==1.15.0
tensorboardX==1.8
torch==1.1.0
torchvision==0.3.0
tqdm==4.56.0
urllib3==1.26.2
vocab==0.0.5
Werkzeug==1.0.1

MohitShridhar · 2021-01-18T17:57:42Z

@IgorDroz Docker is a way to ensure that the setup is completely identical (like CUDA, torch, torchvision etc).

Check out this work, and their reproduced results. Their models are also substantially better than the baselines reported in the ALFRED paper.

I am not sure what else could be causing this issue. Sorry.

IgorDroz · 2021-01-18T18:35:41Z

@MohitShridhar I will definitely check their work out, thanks!
i noticed that there is another work with even better results on the leaderboard, do you have their paper by any chance?

MohitShridhar · 2021-01-18T18:55:05Z

@IgorDroz I don't think the leaderboard topper has made their paper/code publicly available. It's probably a recent submission (or to be submitted), so you'd have to wait for the anonymity period to end.

IgorDroz · 2021-01-18T18:55:56Z

@MohitShridhar okay, thanks a lot!

dnandha · 2021-08-10T23:37:18Z

Cannot reproduce results either using the pre-trained best-seen model (and resnet18-5c106cde.pt). I'm on torch==1.9.0 (py3.7_cuda10.2_cudnn7.6.5_0), results look similar to the ones posted above by other users.

SR: 8/820 = 0.010
GC: 142/2109 = 0.067
PLW SR: 0.003
PLW GC: 0.038

Was anyone able to reproduce the results at all? Just aking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce result for validation set #40

Unable to reproduce result for validation set #40

sgdgp commented Jul 27, 2020 •

edited

Loading

MohitShridhar commented Jul 27, 2020

sgdgp commented Jul 29, 2020

MohitShridhar commented Jul 29, 2020

sgdgp commented Jul 29, 2020

MohitShridhar commented Aug 5, 2020

sgdgp commented Aug 17, 2020

IgorDroz commented Oct 6, 2020

MohitShridhar commented Oct 6, 2020

IgorDroz commented Oct 24, 2020

MohitShridhar commented Oct 24, 2020 •

edited

Loading

IgorDroz commented Oct 25, 2020

MohitShridhar commented Oct 26, 2020

IgorDroz commented Oct 28, 2020

MohitShridhar commented Oct 28, 2020

IgorDroz commented Nov 28, 2020

MohitShridhar commented Nov 28, 2020

IgorDroz commented Jan 18, 2021 •

edited

Loading

MohitShridhar commented Jan 18, 2021

IgorDroz commented Jan 18, 2021

MohitShridhar commented Jan 18, 2021

IgorDroz commented Jan 18, 2021

dnandha commented Aug 10, 2021 •

edited

Loading

Unable to reproduce result for validation set #40

Unable to reproduce result for validation set #40

Comments

sgdgp commented Jul 27, 2020 • edited Loading

MohitShridhar commented Jul 27, 2020

sgdgp commented Jul 29, 2020

MohitShridhar commented Jul 29, 2020

sgdgp commented Jul 29, 2020

MohitShridhar commented Aug 5, 2020

sgdgp commented Aug 17, 2020

IgorDroz commented Oct 6, 2020

MohitShridhar commented Oct 6, 2020

IgorDroz commented Oct 24, 2020

MohitShridhar commented Oct 24, 2020 • edited Loading

IgorDroz commented Oct 25, 2020

MohitShridhar commented Oct 26, 2020

IgorDroz commented Oct 28, 2020

MohitShridhar commented Oct 28, 2020

IgorDroz commented Nov 28, 2020

MohitShridhar commented Nov 28, 2020

IgorDroz commented Jan 18, 2021 • edited Loading

MohitShridhar commented Jan 18, 2021

IgorDroz commented Jan 18, 2021

MohitShridhar commented Jan 18, 2021

IgorDroz commented Jan 18, 2021

dnandha commented Aug 10, 2021 • edited Loading

sgdgp commented Jul 27, 2020 •

edited

Loading

MohitShridhar commented Oct 24, 2020 •

edited

Loading

IgorDroz commented Jan 18, 2021 •

edited

Loading

dnandha commented Aug 10, 2021 •

edited

Loading