Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGNet pretraining goes wrong #15

Open
chengzhag opened this issue Sep 24, 2020 · 15 comments
Open

MGNet pretraining goes wrong #15

chengzhag opened this issue Sep 24, 2020 · 15 comments

Comments

@chengzhag
Copy link

Hi Yinyu:

I tried to pretrain MGNet with python main.py configs/mgnet.yaml --mode train and test it with python main.py configs/mgnet.yaml --mode test.

However, after 50 epochs of training, the learning rate quickly reduced to a seemingly unreasonable level of 1e-08 with the best chamfer_loss stuck at 5.67 after the 6th epoch.
log.txt

Also, the test results of the best checkpoint looks like below:
log.txt

Is there anything I missed?

@yinyunie
Copy link
Collaborator

Hi,

I think you should follow our paper to train it by stages, just as TMNet referred in our work. A general strategy is to first train the AtlasNet (by setting tmn_subnetworks=1 in mgnet.yaml). After it converges, loading the weights (by setting weight path in mgnet.yaml), and fix it to train the second stage.

@chengzhag
Copy link
Author

I have trained with tmn_subnetworks set to 1. However, When I was trying to load the weights and fix them to train the second stage, I didn't find an option to fix the loaded weights.

Option 'train.freeze' seems to be able to control which submodule needs to be fixed. But it can't be used to fix the weights of the first stage.

@chengzhag
Copy link
Author

Noticed that apart from the difference of optimizer settings (learning rate 1e-4 vs 1e-3, different scheduler) between the code and the paper, the batch_size settings are different too (2 vs 32). Can I go through README to reproduce the results of the paper or do I need another modification not mentioned in README?

@chengzhag
Copy link
Author

Also, with tmn_subnetworks set to 1, the training loss and testing loss looks like below.
屏幕截图 2020-09-27 181821
屏幕截图 2020-09-27 181849

looks like there is some thing wrong with edge, face, boundary loss.

@yinyunie
Copy link
Collaborator

Hi,

Boundary loss will only work for points on open boundaries, which works at the second stage (tmn_subnetworks =2). So it will be 0s if tmn_subnetworks =1. The first stage means shape deformation and the second stage is for topology modification.

Edge loss is a regularization term to penalize extra-long edges. It will not change much during training.

Face loss is to classify whether a point on edges/faces should be removed.

We will update our README to make it more detailed after our deadline ends. Here is our training strategy, you can also follow the strategy in this work :

We first set 'tmn_subnetworks=1' and turn off the edge classifier by setting 'with_edge_classifier=False' in config.yaml for training (it is equivalent to AtlasNet). After converging, turn on the 'with_edge_classifier=True' to train the edge classifier in the first stage. The above are the modules in the first stage.

After that, we fix the above modules to train the second-stage decoder using this function. You can add a line
self.mesh_reconstruction.module.freeze_by_stage(2, ['decoder'])
at this place and remember to turn on 'with_edge_classifier=True' and 'tmn_subnetworks=2'.

@chengzhag
Copy link
Author

Thanks a lot for your patience and detailed explanation! I'll try the steps and refer to the work.

@chengzhag
Copy link
Author

Hi Yinyu:
I followed the three steps (MGN1, MGN2, MGN3) and got the following results:
image
It seems that the third step didn't improve the chamfer loss at all. Where did I do wrong?

The test Avg_Chamfer of stage 3 is 9.70. Not as good as the 8.36 of your paper and the 8.14 of the downloaded MGNet checkpoint.

Another question is if the Avg_Chamfer provided by your test code is the same metric in your paper? The paper mentioned that an ICP algorithm is applied to the output which is not in the code.
image

@chengzhag
Copy link
Author

I tried another run. Looks like the learning rate of the first step starts to go down 30 epochs later accidentally. Which results in a better chamfer score after the first step:
image

However, the test chamfer becomes worse after the third step, which is strange. The best chamfer I got is 9.07 which is after the second step of training. This is still not so close to your results.

May I get some more tips about the training process? Is there something wrong with my procedure?

@WenM1222
Copy link

@pidan1231239 Sorry about asking not related question. I want to know how did you visualize the training process? Is this written in the source code?

@chengzhag
Copy link
Author

I used weights and biases. Added a few lines of code.

@WenM1222
Copy link

Thank you for your fast reply! I will also give it a try

@chengzhag
Copy link
Author

No problem!

@Wi-sc
Copy link

Wi-sc commented Dec 2, 2020

@pidan1231239 Hi, have you reproduced the results reported in the paper? I'm also trying to do it but only got 0.103016 (average chamfer distance).

@chengzhag
Copy link
Author

The downloaded checkpoint can achieve 0.008187 Chamfer loss, which is before ICP alignment and probably not the exact code for the final evaluation.

In my best try, the loss can lower down to 0.01028, with batch size changed to 32 like in the paper. However, I used two GPUs in the second stage and one in others because of the memory limitation. Don't know if there is something I missed.

@Cindy0725
Copy link

Noticed that apart from the difference of optimizer settings (learning rate 1e-4 vs 1e-3, different scheduler) between the code and the paper, the batch_size settings are different too (2 vs 32). Can I go through README to reproduce the results of the paper or do I need another modification not mentioned in README?

Hi, the author didn't reply to this question. I am also curious about whether we should follow the batch size and learning rate in the paper or in this GitHub. The batch size, lr and epoch number are all different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants