Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds constant memory backprop #2

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

anshuln
Copy link
Collaborator

@anshuln anshuln commented Jul 18, 2019

The architecture is changed a bit, adding a new abstract class for layers to inherit from.Some changes are also made in the Generator class. Unit tests for gradient computations are added, and a function to make data generator from images is added.
However, there seems to be a small bug in gradient computations involving AffineCoupling layer, which requires some math to be figured out.

@anshuln anshuln requested a review from AlexanderMath July 18, 2019 13:49
@AlexanderMath
Copy link
Owner

Great job, the code looks very promising. Unfortunately I'm a bit sleep deprived today (so I'll focus on smaller refactoring issues today), and I'm busy tomorrow, but I'll probably have all Sunday to merge the code. I try to summarize the changes below to make sure I understand them. Please read my summary and highlight any potential misunderstandings on my behalf.

Overall. There are changes in five files. The main merging work will be with respect to the files containing Generator and Layers, since the code for data loader and unit tests doesn't really interact with previous code.

Generator class: Previously, the Generator inherited the functions train_on_batch, fit and fit_generator from Sequential which inherited the functions from Model. The new fit loops through data and calls train_on_batchwhich has the O(1) mem backprop.

Merging these changes seems fairly straight forward, since the previous Generator class had no functionality.

I understood all todo with one exception. On line 232 at dy,grads = layer.compute_gradients(...) there is a TODO stating 'implementing scaling here'. What does this mean? At line 32: it similarly says "TODO: imlpement scaling here -- DONE". Is this todo already handled?

I really appreciated all other TODOs, they were very easy to read and understand, I believe I'll be able to implement all of them.

Layers: There is a new virtual class called LayerWithGrads which all layers should inherit from. It implements compute_gradients which most layers inherits, except AdditiveCoupling and AffineCoupling. It forces all children to implement call and call_inv. I believe it makes to also force the children to implement log_det, what do you think?

I imagine the biggest difficulty is with respect to the Coupling Layers, especially getting them to work with the multi-scale architecture simultaneously. I will probably be able to get everything except that to work on Sunday.

@AlexanderMath
Copy link
Owner

Out of curiosity. Did you find any way of benchmarking memory usage on GPU? The O(\sqrt(L)) gradient checkpointing was implemented by OpenAI and they have a very nice graph showing the memory consumption as the number of residual blocks increase. When I finish all the merging, I'll try to figure out how they measured memory consumption, and try to make a similar plot which also shows the increase in computation time.

image

https://github.com/cybertronai/gradient-checkpointing

@AlexanderMath AlexanderMath self-assigned this Jul 19, 2019
@AlexanderMath AlexanderMath added the enhancement New feature or request label Jul 19, 2019
@anshuln
Copy link
Collaborator Author

anshuln commented Jul 19, 2019

Couldn't have summarized it better myself!
I just have one comment regarding the Generator class - the loss functions have been changed to have 0 or 1 args as opposed to 2 args in the original code, as a result some refactoring would be required.
Regarding the TODOs in the layer class, they referred to the fact that log_det is divided by a scaling factor before being added to the loss function. This issue has been handled in the layer.compute_gradients function.
Also, log_det should also be a part of the abstract class probably, since flow models definitely use it. The only issue here is that log_det needs to return a tensor for backprop to work, and we need to somehow enforce that condition.

@AlexanderMath
Copy link
Owner

The code is now included on the main branch. I'll leave this open for discussion of the code, and close it when we're happy with the constant memory backprop code. Below I added a few initial thoughts we probably want to discuss the following issues over Skype some day. I'm leaving them here both to open up discussion, but also as notes for myself on things that needs to be taken care of.

(1) In the train_on_batch there is a forward pass self.call(X). After testing memory consumption before and after using tool pip install nvidia-ml-py3, it seems it stores activations; at the very least the memory consumption afterwards was 1 GB larger than before the call (used a model with 100 ActNorm layers with just 2400 weights). There are a few things I need to understand about @tf.function, eager execution and how this all relates to the graph in tensorflow 2.0 before I decide on how to fix this.

(2) Should we remove compile and use the fit to specify optimizer?
I agree this makes sense from our perspective, but it I think it violates the Keras API, and thus believe it is a bad idea. That said, I agree it is a bit messy. The framework will only support loss='maximum_likelihood', so compile doesn't have any responsibility besides optimizer (metrics are by default set to what one needs anyway). Because of this is seems nicer to pass optimizer to fit and remove compile, which is what was done in the pull request.

That said, I currently believe it is better to follow the Keras API so people familiar with this won't be confused when they there is no compile function.

There is a similar issue with metrics. The Keras API specify that metrics takes two arguments y_pred and y_true, however, likelihood computations only use y_pred. In this sense specifying y_true is redundant, but adding this redundancy makes the code follow the Keras API. Performance wise there is no loss since we are only passing along a pointer, so I think (even though it is annoying from our perspective), it will be nicer from the perspective of users who are used to Keras.

(3) I changed the original fit function to take a parameter memory. If it is set to linear it uses the fit method inherited from Model. If set to constant it uses our new fit method. At some point I think we also want the O(sqrt(L)) strategy for architectures were the inverse takes a lot of time to compute, like e.g. iResNet or Residual Flow.
image

(4) In line 233 in of " invtf/generator_const_backprop.py " there was a line
image
The variable gradientsrads is not defined anywhere, I searched the entire file. I assumed this was a typo and changed it to gradients computed the line above. Subsequent code is interpreted with no errors, however, if I train the model with fit and change memory from "linear" to "constant" the model obtains different negative log likelihood (e.g. 8.3 for constant and 8.1 for linear, however, constant sometimes gives very bad 20-40). After debugging it seems (when both loss are around 8) that one approach gets a larger determinant loss and a lower latent space loss, so there might be an issue with scaling, but I have no clear picture of this bug.

(5) TQDM is one of my favorite libraries, but it introduces an unnecessary dependency users of the library will have to install. Furthermore, keras has a nice progressbar we can use instead, see picture below:

image

I currently removed TQDM from the fit function and plan to remove it from fit_generator.

(6) I changed the LayersWithGradient class to be called ÌnvLayer, I think this will be easier for users to understand. I also added the log_det function. When testing I got an issue with the compute_gradients if I had a layer with no trainable weights, for example Squeeze or Normalize.

I'll probably dedicate most of tomorrow (sunday) working on this. Next week I'll be out of office kayaking for a week. We get NeurIPS reviews back this week also, so when I get back the following week I'll have to answer these ASAP, after that I'll get back to working on this.

@anshuln
Copy link
Collaborator Author

anshuln commented Jul 20, 2019

if I train the model with fit and change memory from "linear" to "constant" the model obtains different negative log likelihood (e.g. 8.3 for constant and 8.1 for linear, however, constant sometimes gives very bad 20-40).

Does this model have AffineCoupling layers in it? Could you try the gradient unit test with your architecture to check whether the computations are correct.

When testing I got an issue with the compute_gradients if I had a layer with no trainable weights, for example Squeeze or Normalize.

What is the error exactly, because the unit tests involving these layers were passing on my system.

After testing memory consumption before and after using tool pip install nvidia-ml-py3, it seems it stores activations; at the very least the memory consumption afterwards was 1 GB larger than before the call (used a model with 100 ActNorm layers with just 2400 weights)

I am not sure what is causing this, it maybe that the call function is not doing proper resource management upon overwriting of variables.

We should probably have a discussion about these issues tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants