-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean batch normalization should use batch mean and var for training #3
Comments
yeah that's how batch normalization is done usually but in the Theano code published along with the paper, I didn't see how training and testing is distinguished. Here the update is done if it's on the clean path. How different is the accuracy when this change is made? |
The accuracy improves from 99.29% to 99.41% by using the batch mean and batch var during training. Those are just single runs but the lower one is pretty far outside the error bounds of the papers results. i don't think the update of the moving averages is the problem. I think the problem comeis from always returning the normalization based on the running averages when the update code is called. These lines:
The pseudocode on page 5 of the paper also looks to me like it is the _batch_ mean and var are used to normalize in the decoding step. it is probably enough just to make sure |
I have updated the code. I'm testing it but it takes too long on my laptop. It would be great if you confirm that the code now produces the results presented in the paper. |
Even after making the changes in variable initialization, learning rate and batch norm, the accuracy doesn't improve over 99.29%. @mikowals did you make any other changes? Also, in the last line of the code you posted above, it's supposed to be
|
Looking at the code on the master now, have you set I have fixed the error pointed out above and am rerunning the code that previously got 99.41% accuracy to see if it was some sort of accident. I will report back. |
I hadn't set the validation set size to 0 but even after making the correction I get almost the same results. I'll verify it again. I found a difference in the updation of |
The final accuracy was 99.33%. I got that result on 2 training runs. If I put my typo back ( I am lost as to why the clean, labelled path impacts training and why this implementation is not able to match the papers results. |
I wonder if the remaining difference is the implementation of Adam in Tensorflow vs Blocks. Blocks has different defaults and also an extra decay term that is not available in Tensorflow. The accuracies don't really appear stable after 150 epochs and for me bounce in a range betweeen 99.25 and 99.45 from about 50 epochs onwards. |
The reported error rate for full labelled setting is 0.608 ± 0.013 which means the 99.41% accuracy you obtained concurs with the results of the paper. When I run the code, the accuracy never goes above 99.29%. The first 99% appears after 70 epochs and it bounces in a range between 99 and 99.29 after 100 epochs. I wonder what's different between our implementations. |
So, without the typo, there is no significant difference in accuracy after distinguishing between training and testing? Actually, I didn't notice any separation between training and testing in the original implementation. I think I'll also try out the update method for |
Apparently using a placeholder for a conditional in a tensorflow graph does not work with a simple |
I had that doubt earlier but when I tried it out, it was working. Let me check again. |
Yes a simple |
I think you are using
noise_std > 0
to separate both clean vs corrupted path as well as training vs eval. This causes a problem because during evaluation batch norm mean and var should always be based on the training example averages while during training batch norm is meant to introduce regularization via noise by using the batch mean and var.I changed the code so that
update_batch_norm
only ran during training on the clean path and always normalized with the mean and var of the batch. Like this:I passed a boolean placeholder to the encoder to separate training loops from evaluation loops. Then inside the encoder I used
batch_norm
to normalize by the running averages outside of training steps.This may still not be completely right since I was making all examples labeled examples. With this and the variable initialization fix I trained with 60k labeled examples down to 0.59% error.
The text was updated successfully, but these errors were encountered: