Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Model type for tt/ee not found, assuming encoder-only architecture #283

Closed
josephrocca opened this issue Sep 7, 2023 · 1 comment · Fixed by #276
Closed
Labels
question Further information is requested

Comments

@josephrocca
Copy link
Contributor

josephrocca commented Sep 7, 2023

Reporting this as requested by the warning message, but as a question because I'm not entirely sure if it's a bug:

image

Here's the code I ran:

let quantized = false; // change to `true` for a much smaller model (e.g. 87mb vs 345mb for image model), but lower  accuracy
let { AutoProcessor, CLIPVisionModelWithProjection, RawImage, AutoTokenizer, CLIPTextModelWithProjection } = await import('https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js');
let imageProcessor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
let visionModel = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16', {quantized});
let tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
let textModel = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16', {quantized});

function cosineSimilarity(A, B) {
  if(A.length !== B.length) throw new Error("A.length !== B.length");
  let dotProduct = 0, mA = 0, mB = 0;
  for(let i = 0; i < A.length; i++){
    dotProduct += A[i] * B[i];
    mA += A[i] * A[i];
    mB += B[i] * B[i];
  }
  mA = Math.sqrt(mA);
  mB = Math.sqrt(mB);
  let similarity = dotProduct / (mA * mB);
  return similarity;
}

// get image embedding:
let image = await RawImage.read('https://i.imgur.com/RKsLoNB.png');
let imageInputs = await imageProcessor(image);
let { image_embeds } = await visionModel(imageInputs);
console.log(image_embeds.data);

// get text embedding:
let texts = ['a photo of an astronaut'];
let textInputs = tokenizer(texts, { padding: true, truncation: true });
let { text_embeds } = await textModel(textInputs);
console.log(text_embeds.data);

let similarity = cosineSimilarity(image_embeds.data, text_embeds.data);
console.log(similarity);
@josephrocca josephrocca added the question Further information is requested label Sep 7, 2023
@xenova
Copy link
Collaborator

xenova commented Sep 7, 2023

Thanks for the report - will fix :) It appears to be using the name of the minified class, which is why it is tt or ee.

e.g., to fix it, you can just use https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.js instead of https://cdn.jsdelivr.net/npm/@xenova/[email protected]/dist/transformers.min.js

xenova added a commit that referenced this issue Sep 8, 2023
@xenova xenova linked a pull request Sep 8, 2023 that will close this issue
xenova added a commit that referenced this issue Sep 8, 2023
* Add `CodeLlamaTokenizer`

* Add `codellama` for testing

* Update default quantization settings

* Refactor `PretrainedModel`

* Remove unnecessary error message

* Update llama-code-tokenizer test

* Add support for `GPTNeoX` models

* Fix `GPTNeoXPreTrainedModel` config

* Add support for `GPTJ` models

* Add support for `WavLM` models

* Update list of supported models

- CodeLlama
- GPT NeoX
- GPT-J
- WavLM

* Add support for XLM models

* Add support for `ResNet` models

* Add support for `BeiT` models

* Fix casing of `BeitModel`

* Remove duplicate code

* Update variable name

* Remove `ts-ignore`

* Remove unnecessary duplication

* Update demo model sizes

* [demo] Update default summarization parameters

* Update default quantization parameters for new models

* Remove duplication in mapping

* Update list of supported marian models

* Add support for `CamemBERT` models

* Add support for `MBart` models

* Add support for `OPT` models

* Add `MBartTokenizer` and `MBart50Tokenizer`

* Add example of multilingual translation with MBart models

* Add `CamembertTokenizer`

* Add support for `HerBERT` models

* Add support for `XLMTokenizer`

* Fix `fuse_unk` config

* Do not remove duplicate keys for `Unigram` models

See https://huggingface.co/camembert-base for an example of a Unigram tokenizer that has two tokens with the same value (`<unk>`)

* Update HerBERT supported model text

* Update generate_tests.py

* Update list of supported models

* Use enum object instead of classes for model types

Fixes #283

* Add link to issue

* Update dependencies for unit tests

* Add `sentencepiece` as a testing requirement

* Add `protobuf` to test dependency

* Remove duplicated models to test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants