DropoutAdapter injection is impacting state_dict().keys and the ability to load/save checkpoints smoothly #237

piercus · 2024-01-31T20:01:17Z

piercus
Jan 31, 2024
Collaborator

Hello refiners,

I'm experimenting with trainer and especially I'm facing a problem to load/save models weights

The sequence of the trainer is :

trainer.prepare_models is loading the checkpoint on a non-injected model
on_train_begin is injecting the dropout_adapter
on_checkpoint_save is saving the checkpoint (using model.state_dict())

The named of the Dropout-impacted layers are changed in step 2.

As a result, the model saved in on_checkpoint_save are not compatible with the loading in trainer.prepare_models, and i cannot smootly save/load the model.

Toy example

The injection of the dropout adpater is changing the keys of weights in state_dict()

from refiners.fluxion.layers.chain import Chain
from refiners.fluxion.layers.linear import Linear
from refiners.training_utils.dropout import DropoutAdapter

network = Chain(
    Linear(2, 3)
)

keys = network.state_dict().keys()
print(keys)

probability=0.5

for linear, parent in network.walk(Linear):
    DropoutAdapter(target=linear, probability=probability).inject(parent)

keys2 = network.state_dict().keys()
print(keys2)

is outputing

odict_keys(['Linear.weight', 'Linear.bias'])
odict_keys(['DropoutAdapter.Linear.weight', 'DropoutAdapter.Linear.bias'])

What i'm not clear is what is the target behavior
A. should .inject(parent) change the name of the weights and we should fix the save/load sequence in the trainer ?
B. should .inject(parent) not change the name of the weights in state_dict() when the adapter is not injecting new weights ?

I can help on this if needed

catwell · 2024-01-31T23:14:33Z

catwell
Jan 31, 2024
Collaborator

We have an internal issue tracker, and the oldest issue I opened on it that is still open (before we even opened the source code of refiners) is: "Find a way to avoid changing the state dict when we add blocks without weights" :)

So yes this is something we have been discussing a lot internally. We don't really have a perfect solution for this now and there are always workarounds:

eject the adapter before the snapshot and re-inject afterwards;
make a structural copy of the model, inject the adapter in one and use it to train, use the other one to snapshot;
just post-process the state dict to remove DropoutAdapter from key names.

But if you have ideas it could be great!

Ideally I'd like the solution to also work when we e.g. insert a chain in a model for clarity / easier targeting, and to keep somewhat semantics keys (i.e. not just use for instance ordered keys named 0001 0002 etc, which works but had other issues).

8 replies

catwell Feb 5, 2024
Collaborator

@piercus I didn't look at the details yet but pinging @limiteinductive because he had a similar idea some time ago (with a specific algorithm in mind).

Note: we can't just reason about Dropout, because solving just the Dropout case is easy - we could special-case layers like Dropout so they are ignored in the keys, i.e. so that the state dict was generated as if they were ejected.

I'll make sure to look at this in more details soon, thank you!

limiteinductive Feb 6, 2024
Collaborator

This is indeed something I've been contemplating for a while. A propos Dropout: It would make more sense to target activations instead of targeting linear layers, which would have the effect of not modifying the state_dict (PR incoming).

limiteinductive Feb 6, 2024
Collaborator

My idea for the topological_state_dict is to use its similarity with knot theory and have an algorithm made of something that would be a composition of Reidermeister moves.

limiteinductive Feb 6, 2024
Collaborator

What you propose is similar to saying that the topological_state_dict is invariant under the type I Reidermeister move

catwell Feb 7, 2024
Collaborator

I change my mind on this everyday but: should we do the simple, useful but incomplete solution first?

(Digression: last time I changed my mind from "slightly against" to "slightly in favor" is watching my friend Hisham's talk about Teal last Sunday at FOSDEM, where he made a series of such choices.)

The solution being: have hacks in chain so we can have layers that either "masquerade" as other layers (i.e. for instance if you replace a layer A by B, still name it A in the dict) or are "transparent" (skipped) in the state dict. Those hacks are easy to implement and I already use them when prototyping stuff to avoid converting weights.

That would work for e.g. Dropout, but it would not work for everything though, and it would require manual effort.

limiteinductive · 2024-02-06T22:14:02Z

limiteinductive
Feb 6, 2024
Collaborator

The big fundamental question is, should the keys of the state_dict be human-readable?

2 replies

catwell Feb 7, 2024
Collaborator

My POV on this is it helps a lot when it is. But I am biased by my experience writing and debugging code that converts state dicts between formats. :) Most users don't care.

deltheil Feb 7, 2024
Maintainer

[...] should the keys of the state_dict be human-readable?

Cherry on the cake (agreed on the benefit when debugging/inspecting)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DropoutAdapter injection is impacting state_dict().keys and the ability to load/save checkpoints smoothly #237

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

DropoutAdapter injection is impacting state_dict().keys and the ability to load/save checkpoints smoothly #237

piercus Jan 31, 2024 Collaborator

Toy example

Replies: 2 comments · 10 replies

catwell Jan 31, 2024 Collaborator

catwell Feb 5, 2024 Collaborator

limiteinductive Feb 6, 2024 Collaborator

limiteinductive Feb 6, 2024 Collaborator

limiteinductive Feb 6, 2024 Collaborator

catwell Feb 7, 2024 Collaborator

limiteinductive Feb 6, 2024 Collaborator

catwell Feb 7, 2024 Collaborator

deltheil Feb 7, 2024 Maintainer

piercus
Jan 31, 2024
Collaborator

Replies: 2 comments 10 replies

catwell
Jan 31, 2024
Collaborator

catwell Feb 5, 2024
Collaborator

limiteinductive Feb 6, 2024
Collaborator

limiteinductive Feb 6, 2024
Collaborator

limiteinductive Feb 6, 2024
Collaborator

catwell Feb 7, 2024
Collaborator

limiteinductive
Feb 6, 2024
Collaborator

catwell Feb 7, 2024
Collaborator

deltheil Feb 7, 2024
Maintainer