Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIPA Torch Inference #579

Closed
wants to merge 2 commits into from
Closed

Conversation

zw615
Copy link
Contributor

@zw615 zw615 commented Jul 26, 2023

Add code for gelu_approximate, ln_pre, pooling_style, gloabel_average_pooling, bert_tokenizer, and no attention_mask. Also add some transform options. Also include some example model configs and eval scripts

@zw615 zw615 mentioned this pull request Sep 22, 2023
@rwightman
Copy link
Collaborator

rwightman commented Oct 4, 2023

@zw615 are there weights and expected eval metrics that go with the models for this change? don't see any urls, etc. I have similar changes outstanding for ported siglip weights and want to merge together

Also, I noticed the support for big vision vit variant, big vision vit (and their lm) usually have a head with weight & bias, but here we have no bias (due to OpenAI's defaults).

@zw615
Copy link
Contributor Author

zw615 commented Oct 4, 2023

@rwightman Sure, there is. For now they are posted in our clipa_torch repo (https://github.com/UCSC-VLAA/CLIPA/tree/master/clipa_torch) . But I can merge them into, say, pretrained.py if needed.

I guess the changes needed for SigCLIP bears a lot of similarity with this pr. Maybe reusing them would be more efficient? As for the bias in both stem and head of ViT, we happended to change that in our very early exploration. So weights trained based on CLIPA codebase also do not have bisa in ViT stem and head layer. Thanks for the detailed observation! This would be another slight difference that should be taken into account when merging these code changes.

@rwightman
Copy link
Collaborator

rwightman commented Oct 4, 2023

@zw615 thanks

I will have to combine these model changes and siglip ones carefully, there is a lot of overlap and a few differences. Some of the naming, etc in this PR are too 'big vision' specific, while it was clearly influenced by that repo many of the differences are just classic vit / lm vs OpenAI CLIP specific differences that this repo was based on.

It also appears some awkwardness was caused by past bad calls in OpenCLIP (that we merged a default global pool impl that included the class token, which doesn't make much sense). Looks like your v1 models used that but v2 does it the more sensible way were only spatial tokens are included in GAP?

@zw615
Copy link
Contributor Author

zw615 commented Oct 4, 2023

@rwightman Yes, I agree. Most of the changes proposed by this pr are actually inherited from BigVision implementation, but there are still a few differences between CLIPA and BigVision that also needs attention (e.g. the bias term in stem and head). Given that SigCLIP is based on a newer version of BigVision, merging them would be quite tricky. I would say that the wise thing to do is to figure out all necessary changes before actual merging.

As for the pooling style, we found subtle differences across various implementation (open_clip, BigVision, etc.). We didn't pay much attention to this minor detail in v1, but found BigVision's way of only pooling the spatial tokens to be more sensible and adopted them in v2. That is why we add those pool_style options in model config. Also, note that the text_encoder in BigVision also uses a different pooling style (fetch the last token before prokection).

@rwightman
Copy link
Collaborator

rwightman commented Oct 4, 2023

@zw615 yup, all that makes sense and have pieced together most of the why now have to figure out a nice way of combining it all.

In my siglip changes I am using timm vit because it's already compatible with 'classic' vit in big_vision with addition of an attention pooling mechanism (different from the one in OpenCLIP already). However, adding those options to OpenCLIP ViT for CLIPA support makes sense too.

I already had a pool typing arg, and yeah have the text transformer support for last token as in big_vision, no masking, different tokenizer from both CLIPA and OpenCLIP. All of that can be combined sensibly I feel.

The activation difference (tanh approx), it's actually much less significant difference numerically (when I tested) than the sigmoid approx, but can maybe incorporate that into a more generic activation config (rather than a specific arg for just 'approximation'). That way we could switch to other acts in the future without adding yet another cfg.

Just working through differences and coming up with a plan...

@rwightman
Copy link
Collaborator

@zw615 also agree 100% that for average pooling, excluding the class (or other extra, non spatial tokens) is the best approach. Have had it like that in timm for > 2 years now. The original jax and big_vision never supported enabling a class token with GAP, but sometimes it's useful, esp for fine-tuning.

In other places it's used but there is definitely lack of consistency wrt to including the class token in average or not.

Of additional note, having a 'class token', but excluding it from GAP essentially makes it a register (https://twitter.com/TimDarcet/status/1707769575981424866), so there are a number of models out there, including the EVA02 vits that are '1 register' models already :)

@zw615
Copy link
Contributor Author

zw615 commented Oct 5, 2023

@rwightman

  • Yes, the changes in model are more than what could be accomodated with timm vit. It includes gelu_approximate, ln_pre, pooling_style, gloabel_average_pooling, bert_tokenizer, and no attention_mask. Changes in factory.py are necessary. I think right now this pr have implemented those options in model config in a simple way. But I do agree that a more generic activation arg would be better for future development. I have updated this argument in the new commit. Let me know what you think.
  • And do not forget the changes in transform.py. We used a slightly different transform config, like bilinear interpolation and square resizing. These differences do not have much impact in model training, but a mismatched data augmentation could lead to performance drop in inference.
  • As for the weight urls and evaluation metrics of trained CLIPA weights (both v1 and v2), I have included them in clipa.md in this pr (CLIPA Torch Train #578). Note that the weights are stored on google drive. I am not sure that is the best way to do it. Maybe it is better to have them in pretrained.py so that the code automatically downloads those weights. Yet current open_clip implementation does not support direct downloading from google drive. I cannot post them on github release either, as some weights are too big and not allowed by github. I am open to suggestion in this regard.

@rom1504
Copy link
Collaborator

rom1504 commented Oct 6, 2023

Storing the weights on huggingface would probably be easiest and most coherent with what we have been doing in openclip

@rwightman
Copy link
Collaborator

FYI @zw615 @rom1504 I'm integrating this via #660 with my siglip changes, already done a fairly significant merge from main and will be adding siglip and refactoring so anything added here after this point will not be included.

@rwightman
Copy link
Collaborator

rwightman commented Oct 9, 2023

@zw615 FYI, too late now, but found a few quirks...

  1. The CLIPA "big vision" GAP impl is actually different than big vision so OpenCLIP will need possibly 3 variations to cover
  • original OpenCLIP (CLIPA-v1 used this?, OpenCLIP did not) : average pool over cls + spatial tokens, final norm on pooled output
  • CLIPA-v2 (Big Vision): average pool over spatial tokens (regardless of cls token used or not), final norm on pooled output
  • Big Vision (siglip): average pool over spatial tokens (no cls token enabled), final norm BEFORE pooling
  1. The bert tokenization + text pool combo doesn't make sense. Bert inserts the CLS token at the start of the sentence, there is no causal mask used, so this is fine. But the big vision 'last' pooling used in CLIPA-v2 takes the last token in the sentence where it'd make more send to use the first. Note that in big vision, the bert tokenizer is used with the LIT text towers that use the flaxformer bert which uses the 0th (cls) token. The big vision custom text tower that's used it siglip, etc pools last token, but it's used with a different tokenizer, a t5 sentencepiece based one where eos=padding=1

I don't believe there is a big vision model that uses last + the bert tokenizer

@zw615
Copy link
Contributor Author

zw615 commented Oct 13, 2023

@rwightman A Oh.. Sorry for the late reply. You are right. The GPA in CLIPA is actually a mixture of original OpenCLIP's GAP and Big Vision's GAP. It averages all spatial tokens, but applies the final norm on pooled output.

And about the bert tokenizer + 'last' text pool, yes, the tokenizer inserts the CLS token at the start but the text transformer uses the last token in pooling. I have checked Big Vision's implementation, and I found that LIT also uses pool_type='last'? (https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/models/proj/image_text/text_transformer.py#L50).

Either way, I think these are only slight implementation differences, and should not have much impact on final performance. What can I do to make this right?

@rwightman
Copy link
Collaborator

rwightman commented Oct 13, 2023

@zw615 they were observations (and request for confirmation that you were aware). So nothing to make right. They do work acceptable as per the good result, ablation would be required to determine what would work best.

I was under the impression LiT was using below, text_transformer.py was added with SigLIP a week or so back

https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/models/proj/flaxformer/bert.py#L58

via

https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/configs/proj/image_text/lit_coco.py#L104

And SigLIP, with the 'last' pooling text transformers is using these tokenizers m4 and c4-en https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/pp/ops_text.py#L35-L47 which don't have a bos/cls token, only padding == sticky eos.

@rwightman
Copy link
Collaborator

@zw615 the actionable item, would you like to put the weights for CLIPA-v2 for the final merge in OpenCLIP under a USCS org? I can provide scripts / cmd line to do that once I've finalized pr and done my own test push... or I can put it somewhere else (under timm org with attribution in the model card).

@zw615
Copy link
Contributor Author

zw615 commented Oct 16, 2023

@rwightman Sure! I can do that. Let me know when you are finished. And yeah, scripts to do that definitely needed.

@rwightman
Copy link
Collaborator

@zw615 I've merged it, I didn't want to hold up merging so it'll use the weights I uploaded for now. Once you give me an organization you'd like to move them to, I think I can have a HF admin move the ones I uploaded. Then you (or anyone else ) in that organization can update the README as they like. Also, it looks like you might have 224x224 bigG weight with 82.7 top-1? The link for that on your repo is the same as the 336 one, I'll upload a 224 one if you can provide a different drive link.

@rwightman rwightman closed this Oct 20, 2023
@zw615
Copy link
Contributor Author

zw615 commented Oct 21, 2023

@rwightman Thanks for the great work! And for spotting out the wrong link! A silly mistake, haha. I have updated CLIPA readme to add the 224 bigG weight. But for your convenience, you can also find it here.

As for the HF organization, I have no prior experience with it. I will discuss this issue with other co-authors, and hopefully we can solve this real quick. For now, I see those weights are under your name, and it should be a good temporary solution.

@zw615
Copy link
Contributor Author

zw615 commented Oct 21, 2023

@rwightman And oh, I just found at you ated me in another pull request. Somehow I didn't get the notice. I will reply to that pr as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants