Skip to content

Commit

Permalink
Add support for SigLIP models (#473)
Browse files Browse the repository at this point in the history
* Add support for SigLIP models

* Skip siglip tokenizer tests

* Move SigLIP-specific zero-shot-image-classification logic to pipeline
  • Loading branch information
xenova authored Dec 27, 2023
1 parent 9b84d7b commit e2d17b9
Show file tree
Hide file tree
Showing 10 changed files with 218 additions and 15 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@
1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu.
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
Expand Down
18 changes: 18 additions & 0 deletions scripts/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,24 @@ def main():
device=conv_args.device,
)

elif config.model_type == 'siglip' and conv_args.split_modalities:
# Handle special case for exporting text and vision models separately
from .extra.siglip import SiglipTextModelOnnxConfig, SiglipVisionModelOnnxConfig
from transformers.models.siglip import SiglipTextModel, SiglipVisionModel

text_model = SiglipTextModel.from_pretrained(model_id)
vision_model = SiglipVisionModel.from_pretrained(model_id)

export_models(
models_and_onnx_configs={
"text_model": (text_model, SiglipTextModelOnnxConfig(text_model.config)),
"vision_model": (vision_model, SiglipVisionModelOnnxConfig(vision_model.config)),
},
output_dir=output_model_folder,
opset=conv_args.opset,
device=conv_args.device,
)

# TODO: Enable once https://github.com/huggingface/optimum/pull/1552 is merged
# elif config.model_type == 'clap' and conv_args.split_modalities:
# # Handle special case for exporting text and audio models separately
Expand Down
33 changes: 33 additions & 0 deletions scripts/extra/siglip.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Support exporting vision and text models separately:
# Adapted from https://github.com/huggingface/optimum/issues/1186#issuecomment-1637641760

from optimum.exporters.onnx.model_configs import SiglipTextOnnxConfig, ViTOnnxConfig
from typing import Dict


class SiglipVisionOnnxConfig(ViTOnnxConfig):
pass


class SiglipTextModelOnnxConfig(SiglipTextOnnxConfig):
@property
def outputs(self) -> Dict[str, Dict[int, str]]:
return {
"last_hidden_state": {0: "batch_size", 1: "sequence_length"},
"pooler_output": {0: "batch_size"},
}

def generate_dummy_inputs(self, framework: str = "pt", **kwargs):
dummy_inputs = super().generate_dummy_inputs(framework=framework, **kwargs)
if framework == "pt":
import torch
dummy_inputs["input_ids"] = dummy_inputs["input_ids"].to(dtype=torch.int64)
return dummy_inputs

class SiglipVisionModelOnnxConfig(SiglipVisionOnnxConfig):
@property
def outputs(self) -> Dict[str, Dict[int, str]]:
return {
"last_hidden_state": {0: "batch_size"},
"pooler_output": {0: "batch_size"},
}
9 changes: 8 additions & 1 deletion scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -778,7 +778,14 @@
'nvidia/mit-b5',
],
},

'siglip': {
# Zero-shot image classification and feature extraction
# (with and without `--split_modalities`)
# NOTE: requires --opset 13
'zero-shot-image-classification': [
'nielsr/siglip-base-patch16-224',
],
},
'speecht5': {
# Text-to-audio/Text-to-speech
'text-to-audio': [
Expand Down
123 changes: 122 additions & 1 deletion src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -3159,6 +3159,125 @@ export class CLIPVisionModelWithProjection extends CLIPPreTrainedModel {
//////////////////////////////////////////////////


//////////////////////////////////////////////////
// SigLIP models
export class SiglipPreTrainedModel extends PreTrainedModel { }

/**
* SigLIP Text and Vision Model with a projection layers on top
*
* **Example:** Perform zero-shot image classification with a `SiglipModel`.
*
* ```javascript
* import { AutoTokenizer, AutoProcessor, SiglipModel, RawImage } from '@xenova/transformers';
*
* // Load tokenizer, processor, and model
* const tokenizer = await AutoTokenizer.from_pretrained('Xenova/siglip-base-patch16-224');
* const processor = await AutoProcessor.from_pretrained('Xenova/siglip-base-patch16-224');
* const model = await SiglipModel.from_pretrained('Xenova/siglip-base-patch16-224');
*
* // Run tokenization
* const texts = ['a photo of 2 cats', 'a photo of 2 dogs'];
* const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
*
* // Read image and run processor
* const image = await RawImage.read('http://images.cocodataset.org/val2017/000000039769.jpg');
* const image_inputs = await processor(image);
*
* // Run model with both text and pixel inputs
* const output = await model({ ...text_inputs, ...image_inputs });
* // {
* // logits_per_image: Tensor {
* // dims: [ 1, 2 ],
* // data: Float32Array(2) [ -1.6019744873046875, -10.720091819763184 ],
* // },
* // logits_per_text: Tensor {
* // dims: [ 2, 1 ],
* // data: Float32Array(2) [ -1.6019744873046875, -10.720091819763184 ],
* // },
* // text_embeds: Tensor {
* // dims: [ 2, 768 ],
* // data: Float32Array(1536) [ ... ],
* // },
* // image_embeds: Tensor {
* // dims: [ 1, 768 ],
* // data: Float32Array(768) [ ... ],
* // }
* // }
* ```
*/
export class SiglipModel extends SiglipPreTrainedModel { }

/**
* The text model from SigLIP without any head or projection on top.
*
* **Example:** Compute text embeddings with `SiglipTextModel`.
*
* ```javascript
* import { AutoTokenizer, SiglipTextModel } from '@xenova/transformers';
*
* // Load tokenizer and text model
* const tokenizer = await AutoTokenizer.from_pretrained('Xenova/siglip-base-patch16-224');
* const text_model = await SiglipTextModel.from_pretrained('Xenova/siglip-base-patch16-224');
*
* // Run tokenization
* const texts = ['a photo of 2 cats', 'a photo of 2 dogs'];
* const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });
*
* // Compute embeddings
* const { pooler_output } = await text_model(text_inputs);
* // Tensor {
* // dims: [ 2, 768 ],
* // type: 'float32',
* // data: Float32Array(1536) [ ... ],
* // size: 1536
* // }
* ```
*/
export class SiglipTextModel extends SiglipPreTrainedModel {

/** @type {PreTrainedModel.from_pretrained} */
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
// Update default model file name if not provided
options.model_file_name ??= 'text_model';
return super.from_pretrained(pretrained_model_name_or_path, options);
}
}

/**
* The vision model from SigLIP without any head or projection on top.
*
* **Example:** Compute vision embeddings with `SiglipVisionModel`.
*
* ```javascript
* import { AutoProcessor, SiglipVisionModel, RawImage} from '@xenova/transformers';
*
* // Load processor and vision model
* const processor = await AutoProcessor.from_pretrained('Xenova/siglip-base-patch16-224');
* const vision_model = await SiglipVisionModel.from_pretrained('Xenova/siglip-base-patch16-224');
*
* // Read image and run processor
* const image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
* const image_inputs = await processor(image);
*
* // Compute embeddings
* const { pooler_output } = await vision_model(image_inputs);
* // Tensor {
* // dims: [ 1, 768 ],
* // type: 'float32',
* // data: Float32Array(768) [ ... ],
* // size: 768
* // }
* ```
*/
export class SiglipVisionModel extends CLIPPreTrainedModel {
/** @type {PreTrainedModel.from_pretrained} */
static async from_pretrained(pretrained_model_name_or_path, options = {}) {
// Update default model file name if not provided
options.model_file_name ??= 'vision_model';
return super.from_pretrained(pretrained_model_name_or_path, options);
}
}
//////////////////////////////////////////////////
// ChineseCLIP models
export class ChineseCLIPPreTrainedModel extends PreTrainedModel { }
Expand Down Expand Up @@ -4902,6 +5021,7 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
['clip', ['CLIPModel', CLIPModel]],
['clipseg', ['CLIPSegModel', CLIPSegModel]],
['chinese_clip', ['ChineseCLIPModel', ChineseCLIPModel]],
['siglip', ['SiglipModel', SiglipModel]],
['mobilebert', ['MobileBertModel', MobileBertModel]],
['squeezebert', ['SqueezeBertModel', SqueezeBertModel]],
['wav2vec2', ['Wav2Vec2Model', Wav2Vec2Model]],
Expand Down Expand Up @@ -5190,7 +5310,8 @@ for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {
const CUSTOM_MAPPING = [
['CLIPTextModelWithProjection', CLIPTextModelWithProjection, MODEL_TYPES.EncoderOnly],
['CLIPVisionModelWithProjection', CLIPVisionModelWithProjection, MODEL_TYPES.EncoderOnly],

['SiglipTextModel', SiglipTextModel, MODEL_TYPES.EncoderOnly],
['SiglipVisionModel', SiglipVisionModel, MODEL_TYPES.EncoderOnly],
['ClapTextModelWithProjection', ClapTextModelWithProjection, MODEL_TYPES.EncoderOnly],
['ClapAudioModelWithProjection', ClapAudioModelWithProjection, MODEL_TYPES.EncoderOnly],
]
Expand Down
9 changes: 7 additions & 2 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -1791,7 +1791,7 @@ export class ZeroShotImageClassificationPipeline extends Pipeline {

// Run tokenization
const text_inputs = this.tokenizer(texts, {
padding: true,
padding: this.model.config.model_type === 'siglip' ? 'max_length' : true,
truncation: true
});

Expand All @@ -1801,11 +1801,16 @@ export class ZeroShotImageClassificationPipeline extends Pipeline {
// Run model with both text and pixel inputs
const output = await this.model({ ...text_inputs, pixel_values });

const function_to_apply =
this.model.config.model_type === 'siglip'
? batch => batch.sigmoid().data
: batch => softmax(batch.data);

// Compare each image with each candidate label
const toReturn = [];
for (const batch of output.logits_per_image) {
// Compute softmax per image
const probs = softmax(batch.data);
const probs = function_to_apply(batch);

const result = [...probs].map((x, i) => ({
score: x,
Expand Down
21 changes: 16 additions & 5 deletions src/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,8 @@ export class ImageFeatureExtractor extends FeatureExtractor {
constructor(config) {
super(config);

this.image_mean = this.config.image_mean;
this.image_std = this.config.image_std;
this.image_mean = this.config.image_mean ?? this.config.mean;
this.image_std = this.config.image_std ?? this.config.std;

this.resample = this.config.resample ?? 2; // 2 => bilinear
this.do_rescale = this.config.do_rescale ?? true;
Expand Down Expand Up @@ -396,6 +396,17 @@ export class ImageFeatureExtractor extends FeatureExtractor {
return [pixelData, imgDims];
}

/**
* Rescale the image' pixel values by `this.rescale_factor`.
* @param {Float32Array} pixelData The pixel data to rescale.
* @returns {void}
*/
rescale(pixelData) {
for (let i = 0; i < pixelData.length; ++i) {
pixelData[i] = this.rescale_factor * pixelData[i];
}
}

/**
* @typedef {object} PreprocessedImage
* @property {HeightWidth} original_size The original size of the image.
Expand Down Expand Up @@ -532,9 +543,7 @@ export class ImageFeatureExtractor extends FeatureExtractor {
let imgDims = [image.height, image.width, image.channels];

if (this.do_rescale) {
for (let i = 0; i < pixelData.length; ++i) {
pixelData[i] = this.rescale_factor * pixelData[i];
}
this.rescale(pixelData);
}

if (do_normalize ?? this.do_normalize) {
Expand Down Expand Up @@ -679,6 +688,7 @@ export class DPTFeatureExtractor extends ImageFeatureExtractor { }
export class GLPNFeatureExtractor extends ImageFeatureExtractor { }
export class CLIPFeatureExtractor extends ImageFeatureExtractor { }
export class ChineseCLIPFeatureExtractor extends ImageFeatureExtractor { }
export class SiglipImageProcessor extends ImageFeatureExtractor { }
export class ConvNextFeatureExtractor extends ImageFeatureExtractor { }
export class ConvNextImageProcessor extends ConvNextFeatureExtractor { } // NOTE extends ConvNextFeatureExtractor
export class ViTFeatureExtractor extends ImageFeatureExtractor { }
Expand Down Expand Up @@ -1764,6 +1774,7 @@ export class AutoProcessor {
OwlViTFeatureExtractor,
CLIPFeatureExtractor,
ChineseCLIPFeatureExtractor,
SiglipImageProcessor,
ConvNextFeatureExtractor,
ConvNextImageProcessor,
SegformerFeatureExtractor,
Expand Down
15 changes: 9 additions & 6 deletions src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -2490,7 +2490,7 @@ export class PreTrainedTokenizer extends Callable {
* @param {string|string[]} text The text to tokenize.
* @param {Object} options An optional object containing the following properties:
* @param {string|string[]} [options.text_pair=null] Optional second sequence to be encoded. If set, must be the same type as text.
* @param {boolean} [options.padding=false] Whether to pad the input sequences.
* @param {boolean|'max_length'} [options.padding=false] Whether to pad the input sequences.
* @param {boolean} [options.add_special_tokens=true] Whether or not to add the special tokens associated with the corresponding model.
* @param {boolean} [options.truncation=null] Whether to truncate the input sequences.
* @param {number} [options.max_length=null] Maximum length of the returned list and optionally padding length.
Expand Down Expand Up @@ -2551,11 +2551,13 @@ export class PreTrainedTokenizer extends Callable {
// At this point, tokens is batched: [batch_size, tokens]
// However, array may be jagged. So, we pad to max_length

let maxLengthOfBatch = max(tokens.map(x => x.length))[0];

// If null, we calculate max length from sequences
if (max_length === null) {
max_length = maxLengthOfBatch;
if (padding === 'max_length') {
max_length = this.model_max_length;
} else {
// Calculate max length from sequences
max_length = max(tokens.map(x => x.length))[0];
}
}

// Ensure it is less than model max length
Expand Down Expand Up @@ -4115,7 +4117,7 @@ export class WhisperTokenizer extends PreTrainedTokenizer {
}
export class CodeGenTokenizer extends PreTrainedTokenizer { }
export class CLIPTokenizer extends PreTrainedTokenizer { }

export class SiglipTokenizer extends PreTrainedTokenizer { }

/**
* @todo This model is not yet supported by Hugging Face's "fast" tokenizers library (https://github.com/huggingface/tokenizers).
Expand Down Expand Up @@ -4221,6 +4223,7 @@ export class AutoTokenizer {
WhisperTokenizer,
CodeGenTokenizer,
CLIPTokenizer,
SiglipTokenizer,
MarianTokenizer,
BloomTokenizer,
NllbTokenizer,
Expand Down
3 changes: 3 additions & 0 deletions tests/generate_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@
# TODO: remove when https://github.com/huggingface/transformers/issues/26547 is fixed
'speecht5',

# TODO: remove when https://github.com/huggingface/transformers/pull/26522 is merged
'siglip',

# TODO: remove when https://github.com/huggingface/transformers/issues/28164 is fixed
'roformer',

Expand Down

0 comments on commit e2d17b9

Please sign in to comment.