Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce result for validation set #40

Open
sgdgp opened this issue Jul 27, 2020 · 22 comments
Open

Unable to reproduce result for validation set #40

sgdgp opened this issue Jul 27, 2020 · 22 comments

Comments

@sgdgp
Copy link

sgdgp commented Jul 27, 2020

Hi,
Thanks for the amazing dataset and for sharing your code.
I am unable to reproduce the results for validation set seen.
I downloaded the checkpoints as provided by you and I am using the best_seen.pth
I am getting SR 0.0097 and GC 0.0659 whereas the result on val seen in the paper is SR 0.037 and GC 0.1.

Could you point to any stuff I might have missed ?

For starting XServer I used
sudo nvidia-xconfig -a --use-display-device=None --virtual=1024x786
sudo /usr/bin/X :0 &

I face two warnings
UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details. warnings.warn("Default upsampling behavior when mode={} is changed "

UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")

The second warning won't affect the results but I wanted to confirm if upsampling with align corners was intended or whether the warning appeared earlier too and I should ignore it ?

@MohitShridhar
Copy link
Collaborator

Hi @sgdgp, some users have reported that they had to train the model themselves instead of using the pretrained models. I still haven't figured out the source of this issue, but it seems like only certain users are affected by this. I'll mention it in the FAQ.

You might have better luck with the Docker setup.

@sgdgp
Copy link
Author

sgdgp commented Jul 29, 2020

Thanks @MohitShridhar . Also the training is with decoder teacher forcing enabled right ?

@MohitShridhar
Copy link
Collaborator

Ah no, leave it to the default False. You can use the settings specified in the training example.

@sgdgp
Copy link
Author

sgdgp commented Jul 29, 2020

Oh I see. Thanks!

@MohitShridhar
Copy link
Collaborator

Not sure if this is causing the issue, but check that your versions of torch and torchvision are consistent with requirements.txt

@sgdgp
Copy link
Author

sgdgp commented Aug 17, 2020

I am trying with the dockerfile. I will update on the status soon.

@IgorDroz
Copy link

IgorDroz commented Oct 6, 2020

@sgdgp Have you managed to reproduce the paper's results after all?

@MohitShridhar How did you choose the model that produces the results in your paper?
I tried both - best_seen and best_unseen and both perform worse.

@MohitShridhar
Copy link
Collaborator

@IgorDroz we picked the best_seen model.

You can try training the model yourself (from scratch) if the problem persists.

@IgorDroz
Copy link

@MohitShridhar

I trained from scratch and got (valid_seen)
SR: 12/820 = 0.015
GC: 172/2109 = 0.082
PLW SR: 0.011
PLW GC: 0.072

while in the paper you achieved :
SR: 0.032
GC: 0.1
PLW SR: 0.021
PLW GC: 0.07

The only difference between me and you is the initialization but still you got x2 better results in SR..

Additionally, i wanted to ask you regarding the testing. Is it only done via submission? or since the challenge has finished, will you be able to release a code with the actual GT of the test?

Thanks,
Igor

@MohitShridhar
Copy link
Collaborator

MohitShridhar commented Oct 24, 2020

The only difference between me and you is the initialization but still you got x2 better results in SR..

Sorry, what's the initialization difference? And also, is this inside a Docker container?

or since the challenge has finished, will you be able to release a code with the actual GT of the test?

No. The leaderboard is a perpetual benchmark for ALFRED. As with any benchmark in the community, the test set will remain a secret to prevent cheating/overfitting. To evaluate on the test set, use the leaderboard submission.

@IgorDroz
Copy link

@MohitShridhar The initialization of the neural net, the initial weights. And no, it is not inside a docker.

@MohitShridhar
Copy link
Collaborator

@IgorDroz can you report your torch and torchvision versions along with CUDA and GPU specs? Also, which resnet checkpoint are you using from torchvision?

@IgorDroz
Copy link

@MohitShridhar
torch==1.1.0
torchvision==0.3.0
CUDA Version: 11.1
GPU is Tesla K80
nvidia Driver Version: 455.23.05

How can i check the resnet checkpoint?

@MohitShridhar
Copy link
Collaborator

@IgorDroz, it's usually inside $HOME/.cache/torch/checkpoints/. I am using resnet34-333f7ec4.pth.

@IgorDroz
Copy link

@IgorDroz, it's usually inside $HOME/.cache/torch/checkpoints/. I am using resnet34-333f7ec4.pth.

@MohitShridhar Sorry for the late answer, so probably this is the difference, i use resnet18-5c106cde.pth. Now it makes sense, thanks!

@MohitShridhar
Copy link
Collaborator

Oops, sorry. I just checked again. I am also using resnet18-5c106cde.pth, so it's probably not the issue.

The next thing to try would be run this inside docker to make sure the setup is exactly the same.

@IgorDroz
Copy link

IgorDroz commented Jan 18, 2021

@MohitShridhar Hi again,

Just saw your answer. yet i am not able to reproduce your results, docker shouldn't really matter as the environment is the same and i should be able to get similar results to yours...

a recap of what i tried and what i got:

  1. I used your pre-trained model (https://github.com/askforalfred/alfred/tree/master/models#pre-trained-model) and ran evaluation.
    The results are:
    SR: 8/820 = 0.01
    GC: 143/2109 = 0.068
    PLW SR: 0.003
    PLW GC: 0.038

Which results did you achieve with this model? because they are pretty far from what you have reported in the paper:
SR: 0.032
GC: 0.1
PLW SR: 0.021
PLW GC: 0.07

  1. i also trained from scratch and got:
    SR: 8/820 = 0.01
    GC: 143/2109 = 0.068
    PLW SR: 0.007
    PLW GC: 0.049
    (which is quite similar to the results i got using your pretrained model)

this time i used P100 GPU like you, yet the results are different.
How can it be? i will attach my packages:

ai2thor==2.1.0
cached-property==1.5.2
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
cycler==0.10.0
decorator==4.4.2
Flask==1.1.2
h5py==3.1.0
idna==2.10
itsdangerous==1.1.0
Jinja2==2.11.2
kiwisolver==1.3.1
MarkupSafe==1.1.1
matplotlib==3.3.3
networkx==2.5
numpy==1.19.5
opencv-python==4.5.1.48
pandas==1.2.0
Pillow==8.1.0
progressbar2==3.53.1
protobuf==3.14.0
pyparsing==2.4.7
python-dateutil==2.8.1
python-utils==2.4.0
pytz==2020.5
PyYAML==5.3.1
requests==2.25.1
revtok==0.0.3
six==1.15.0
tensorboardX==1.8
torch==1.1.0
torchvision==0.3.0
tqdm==4.56.0
urllib3==1.26.2
vocab==0.0.5
Werkzeug==1.0.1

@MohitShridhar
Copy link
Collaborator

@IgorDroz Docker is a way to ensure that the setup is completely identical (like CUDA, torch, torchvision etc).

Check out this work, and their reproduced results. Their models are also substantially better than the baselines reported in the ALFRED paper.

I am not sure what else could be causing this issue. Sorry.

@IgorDroz
Copy link

@MohitShridhar I will definitely check their work out, thanks!
i noticed that there is another work with even better results on the leaderboard, do you have their paper by any chance?

@MohitShridhar
Copy link
Collaborator

@IgorDroz I don't think the leaderboard topper has made their paper/code publicly available. It's probably a recent submission (or to be submitted), so you'd have to wait for the anonymity period to end.

@IgorDroz
Copy link

@MohitShridhar okay, thanks a lot!

@dnandha
Copy link

dnandha commented Aug 10, 2021

Cannot reproduce results either using the pre-trained best-seen model (and resnet18-5c106cde.pt). I'm on torch==1.9.0 (py3.7_cuda10.2_cudnn7.6.5_0), results look similar to the ones posted above by other users.

SR: 8/820 = 0.010
GC: 142/2109 = 0.067
PLW SR: 0.003
PLW GC: 0.038

Was anyone able to reproduce the results at all? Just aking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants