-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLIPA Torch Inference #579
Conversation
@zw615 are there weights and expected eval metrics that go with the models for this change? don't see any urls, etc. I have similar changes outstanding for ported siglip weights and want to merge together Also, I noticed the support for big vision vit variant, big vision vit (and their lm) usually have a head with weight & bias, but here we have no bias (due to OpenAI's defaults). |
@rwightman Sure, there is. For now they are posted in our I guess the changes needed for SigCLIP bears a lot of similarity with this pr. Maybe reusing them would be more efficient? As for the bias in both stem and head of ViT, we happended to change that in our very early exploration. So weights trained based on CLIPA codebase also do not have bisa in ViT stem and head layer. Thanks for the detailed observation! This would be another slight difference that should be taken into account when merging these code changes. |
@zw615 thanks I will have to combine these model changes and siglip ones carefully, there is a lot of overlap and a few differences. Some of the naming, etc in this PR are too 'big vision' specific, while it was clearly influenced by that repo many of the differences are just classic vit / lm vs OpenAI CLIP specific differences that this repo was based on. It also appears some awkwardness was caused by past bad calls in OpenCLIP (that we merged a default global pool impl that included the class token, which doesn't make much sense). Looks like your v1 models used that but v2 does it the more sensible way were only spatial tokens are included in GAP? |
@rwightman Yes, I agree. Most of the changes proposed by this pr are actually inherited from BigVision implementation, but there are still a few differences between CLIPA and BigVision that also needs attention (e.g. the bias term in stem and head). Given that SigCLIP is based on a newer version of BigVision, merging them would be quite tricky. I would say that the wise thing to do is to figure out all necessary changes before actual merging. As for the pooling style, we found subtle differences across various implementation (open_clip, BigVision, etc.). We didn't pay much attention to this minor detail in v1, but found BigVision's way of only pooling the spatial tokens to be more sensible and adopted them in v2. That is why we add those |
@zw615 yup, all that makes sense and have pieced together most of the why now have to figure out a nice way of combining it all. In my siglip changes I am using timm vit because it's already compatible with 'classic' vit in big_vision with addition of an attention pooling mechanism (different from the one in OpenCLIP already). However, adding those options to OpenCLIP ViT for CLIPA support makes sense too. I already had a pool typing arg, and yeah have the text transformer support for last token as in big_vision, no masking, different tokenizer from both CLIPA and OpenCLIP. All of that can be combined sensibly I feel. The activation difference (tanh approx), it's actually much less significant difference numerically (when I tested) than the sigmoid approx, but can maybe incorporate that into a more generic activation config (rather than a specific arg for just 'approximation'). That way we could switch to other acts in the future without adding yet another cfg. Just working through differences and coming up with a plan... |
@zw615 also agree 100% that for average pooling, excluding the class (or other extra, non spatial tokens) is the best approach. Have had it like that in timm for > 2 years now. The original jax and big_vision never supported enabling a class token with GAP, but sometimes it's useful, esp for fine-tuning. In other places it's used but there is definitely lack of consistency wrt to including the class token in average or not. Of additional note, having a 'class token', but excluding it from GAP essentially makes it a register (https://twitter.com/TimDarcet/status/1707769575981424866), so there are a number of models out there, including the EVA02 vits that are '1 register' models already :) |
|
Storing the weights on huggingface would probably be easiest and most coherent with what we have been doing in openclip |
@zw615 FYI, too late now, but found a few quirks...
I don't believe there is a big vision model that uses last + the bert tokenizer |
@rwightman A Oh.. Sorry for the late reply. You are right. The GPA in CLIPA is actually a mixture of original OpenCLIP's GAP and Big Vision's GAP. It averages all spatial tokens, but applies the final norm on pooled output. And about the bert tokenizer + 'last' text pool, yes, the tokenizer inserts the CLS token at the start but the text transformer uses the last token in pooling. I have checked Big Vision's implementation, and I found that LIT also uses pool_type='last'? (https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/models/proj/image_text/text_transformer.py#L50). Either way, I think these are only slight implementation differences, and should not have much impact on final performance. What can I do to make this right? |
@zw615 they were observations (and request for confirmation that you were aware). So nothing to make right. They do work acceptable as per the good result, ablation would be required to determine what would work best. I was under the impression LiT was using below, text_transformer.py was added with SigLIP a week or so back via And SigLIP, with the 'last' pooling text transformers is using these tokenizers m4 and c4-en https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/pp/ops_text.py#L35-L47 which don't have a bos/cls token, only padding == sticky eos. |
@zw615 the actionable item, would you like to put the weights for CLIPA-v2 for the final merge in OpenCLIP under a USCS org? I can provide scripts / cmd line to do that once I've finalized pr and done my own test push... or I can put it somewhere else (under timm org with attribution in the model card). |
@rwightman Sure! I can do that. Let me know when you are finished. And yeah, scripts to do that definitely needed. |
@zw615 I've merged it, I didn't want to hold up merging so it'll use the weights I uploaded for now. Once you give me an organization you'd like to move them to, I think I can have a HF admin move the ones I uploaded. Then you (or anyone else ) in that organization can update the README as they like. Also, it looks like you might have 224x224 bigG weight with 82.7 top-1? The link for that on your repo is the same as the 336 one, I'll upload a 224 one if you can provide a different drive link. |
@rwightman Thanks for the great work! And for spotting out the wrong link! A silly mistake, haha. I have updated CLIPA readme to add the 224 bigG weight. But for your convenience, you can also find it here. As for the HF organization, I have no prior experience with it. I will discuss this issue with other co-authors, and hopefully we can solve this real quick. For now, I see those weights are under your name, and it should be a good temporary solution. |
@rwightman And oh, I just found at you ated me in another pull request. Somehow I didn't get the notice. I will reply to that pr as well |
Add code for gelu_approximate, ln_pre, pooling_style, gloabel_average_pooling, bert_tokenizer, and no attention_mask. Also add some transform options. Also include some example model configs and eval scripts