Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The value of each transformerblock in the infer phase is particularly large #109

Open
lgs00 opened this issue Dec 3, 2024 · 3 comments
Open

Comments

@lgs00
Copy link

lgs00 commented Dec 3, 2024

When I infer with cogvideoX-1.5-t2v, I found an interesting thing, after the processing ofself.norm1 = CogVideoXLayerNormZero, the value of norm_hidden_states was particularly large. What is the reason? In this line of diffusers:(https://github.com/huggingface/diffusers/blob/30f2e9bd202c89bb3862c8ada470d0d1ac8ee0e5/src/diffusers/models/transformers/cogvideox_transformer_3d.py#L127)
norm_hidden_states, norm_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(hidden_states, encoder_hidden_states, temb)
image

@a-r-r-o-w
Copy link
Owner

a-r-r-o-w commented Dec 3, 2024

The actual normalization happens in an internal layer where the values are small, but are then scaled and shifted (adaptive layer norm)

https://github.com/huggingface/diffusers/blob/fc72e0f2616ff993733eaa0310f0253646e0c525/src/diffusers/models/normalization.py#L483

This is from the original implementation and not particular to any changes in the diffusers implementation, so this is expected.

@lgs00
Copy link
Author

lgs00 commented Dec 3, 2024

The actual normalization happens in an internal layer where the values are small, but are then scaled and shifted (adaptive layer norm)

https://github.com/huggingface/diffusers/blob/fc72e0f2616ff993733eaa0310f0253646e0c525/src/diffusers/models/normalization.py#L483

This is from the original implementation and not particular to any changes in the diffusers implementation, so this is expected.

Thank you for your answer, but when I infer at cogvideoX1.0-5b, norm_hidden_states look normal. I wonder why the scale shrinks a lot in 1.5, while the result of 1.0 is more consistent with our cognition. Curious what causes this change, norm_hidden_states shouldn't normally be above 100.
The result of 1.0 is as follows:
image

@a-r-r-o-w
Copy link
Owner

I think it is more well suited to ask this in the original CogVideo repo because it probably stems from training, and the model authors might be able to provide a better answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants