Skip to content

Commit

Permalink
Add support for PaliGemma (&PaliGemma2)
Browse files Browse the repository at this point in the history
  • Loading branch information
xenova committed Dec 6, 2024
1 parent ead1f22 commit edbf767
Show file tree
Hide file tree
Showing 6 changed files with 124 additions and 5 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) by the PaliGemma Google team.
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/abs/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@
1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al.
1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby.
1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby.
1. **[PaliGemma](https://huggingface.co/docs/transformers/main/model_doc/paligemma)** (from Google) released with the papers [PaliGemma: A versatile 3B VLM for transfer](https://arxiv.org/abs/2407.07726) and [PaliGemma 2: A Family of Versatile VLMs for Transfer](https://arxiv.org/abs/2412.03555) by the PaliGemma Google team.
1. **[PatchTSMixer](https://huggingface.co/docs/transformers/main/model_doc/patchtsmixer)** (from IBM) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/abs/2306.09364) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[PatchTST](https://huggingface.co/docs/transformers/main/model_doc/patchtst)** (from Princeton University, IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.
1. **[Phi](https://huggingface.co/docs/transformers/main/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.
Expand Down
37 changes: 32 additions & 5 deletions src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -558,7 +558,9 @@ async function decoderForward(self, model_inputs, is_encoder_decoder = false) {
new_model_inputs.use_cache_branch = boolTensor(!!past_key_values);
}
if (session.inputNames.includes('position_ids') && new_model_inputs.attention_mask && !new_model_inputs.position_ids) {
new_model_inputs.position_ids = createPositionIds(new_model_inputs, past_key_values);
// NOTE: Handle a special case for paligemma models, where positions are 1-indexed
const start_index = self.config.model_type === 'paligemma' ? 1 : 0;
new_model_inputs.position_ids = createPositionIds(new_model_inputs, past_key_values, start_index);
}

// Unpack the `past_key_values` object into model inputs
Expand Down Expand Up @@ -694,14 +696,14 @@ async function imageTextToTextForward(self, {
* @param {Tensor} attention_mask
* @returns {{data: BigInt64Array, dims: number[]}}
*/
function cumsum_masked_fill(attention_mask) {
function cumsum_masked_fill(attention_mask, start_index = 0) {
const [bz, seq_len] = attention_mask.dims;
const attn_mask_data = attention_mask.data;

const data = new BigInt64Array(attn_mask_data.length);
for (let i = 0; i < bz; ++i) {
const start = i * seq_len;
let sum = BigInt(0);
let sum = BigInt(start_index);
for (let j = 0; j < seq_len; ++j) {
const index = start + j;
if (attn_mask_data[index] === 0n) {
Expand All @@ -728,10 +730,10 @@ function cumsum_masked_fill(attention_mask) {
* position_ids = position_ids[:, -input_ids.shape[1] :]
* ```
*/
function createPositionIds(model_inputs, past_key_values = null) {
function createPositionIds(model_inputs, past_key_values = null, start_index = 0) {
const { input_ids, inputs_embeds, attention_mask } = model_inputs;

const { data, dims } = cumsum_masked_fill(attention_mask);
const { data, dims } = cumsum_masked_fill(attention_mask, start_index);
let position_ids = new Tensor('int64', data, dims);
if (past_key_values) {
const offset = -(input_ids ?? inputs_embeds).dims.at(1);
Expand Down Expand Up @@ -3548,6 +3550,30 @@ export class Florence2ForConditionalGeneration extends Florence2PreTrainedModel
}
}

export class PaliGemmaPreTrainedModel extends PreTrainedModel {
forward_params = [
'input_ids',
// 'inputs_embeds',
'attention_mask',
'pixel_values',
'position_ids',
'past_key_values',
];
}

export class PaliGemmaForConditionalGeneration extends PaliGemmaPreTrainedModel {
_merge_input_ids_with_image_features(kwargs) {
const vision_hidden_size = kwargs.image_features.dims.at(-1);
const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);

return default_merge_input_ids_with_image_features({
// @ts-ignore
image_token_id: this.config.image_token_index,
...kwargs,
image_features: reshaped_image_hidden_states,
})
}
}

//////////////////////////////////////////////////
// Idefics3 Models
Expand Down Expand Up @@ -7000,6 +7026,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
]);

const MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
Expand Down
83 changes: 83 additions & 0 deletions src/models/paligemma/processing_paligemma.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import { Processor } from "../../base/processing_utils.js";
import { AutoImageProcessor } from "../auto/image_processing_auto.js";
import { AutoTokenizer } from "../../tokenizers.js";

const IMAGE_TOKEN = "<image>";

function build_string_from_input(
prompt,
bos_token,
image_seq_len,
image_token,
num_images,
) {
return `${image_token.repeat(image_seq_len * num_images)}${bos_token}${prompt}\n`
}

export class PaliGemmaProcessor extends Processor {
static tokenizer_class = AutoTokenizer
static image_processor_class = AutoImageProcessor
static uses_processor_config = false;

/**
* @typedef {import('../../utils/image.js').RawImage} RawImage
*/

// `images` is required, `text` is optional
async _call(/** @type {RawImage|RawImage[]} */ images, text = null, kwargs = {}) {
if (!text) {
console.warn(
"You are using PaliGemma without a text prefix. It will perform as a picture-captioning model."
)
text = ""
}

if (!Array.isArray(images)) {
images = [images]
}

if (!Array.isArray(text)) {
text = [text]
}

const bos_token = this.tokenizer.bos_token;
const image_seq_length = this.image_processor.config.image_seq_length;
let input_strings;
if (text.some((t) => t.includes(IMAGE_TOKEN))) {
console.log('this.image_processor.config', this.image_processor.config)
input_strings = text.map(
sample => {
const expanded_sample = sample.replaceAll(IMAGE_TOKEN, IMAGE_TOKEN.repeat(image_seq_length));
const bos_rfind_index = expanded_sample.lastIndexOf(IMAGE_TOKEN);
const bos_index = bos_rfind_index === -1 ? 0 : bos_rfind_index + IMAGE_TOKEN.length;
return expanded_sample.slice(0, bos_index) + bos_token + expanded_sample.slice(bos_index) + "\n";
}
)
} else {
console.warn(
"You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special " +
"image tokens in the text, as many tokens as there are images per each text. It is recommended to " +
"add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images " +
"each text has and add special tokens."
)

input_strings = text.map(
sample => build_string_from_input(
sample,
bos_token,
image_seq_length,
IMAGE_TOKEN,
images.length,
)
)
}

const text_inputs = this.tokenizer(input_strings, kwargs);
const image_inputs = await this.image_processor(images, kwargs);

return {
...image_inputs,
...text_inputs,
}
}
}
1 change: 1 addition & 0 deletions src/models/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ export * from './idefics3/processing_idefics3.js';
export * from './janus/processing_janus.js';
export * from './jina_clip/processing_jina_clip.js';
export * from './owlvit/processing_owlvit.js';
export * from './paligemma/processing_paligemma.js';
export * from './pyannote/processing_pyannote.js';
export * from './qwen2_vl/processing_qwen2_vl.js';
export * from './sam/processing_sam.js';
Expand Down
6 changes: 6 additions & 0 deletions src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -2605,6 +2605,12 @@ export class PreTrainedTokenizer extends Callable {
this.unk_token = this.getToken('unk_token');
this.unk_token_id = this.model.tokens_to_ids.get(this.unk_token);

this.bos_token = this.getToken('bos_token');
this.bos_token_id = this.model.tokens_to_ids.get(this.bos_token);

this.eos_token = this.getToken('eos_token');
this.eos_token_id = this.model.tokens_to_ids.get(this.eos_token);

this.model_max_length = tokenizerConfig.model_max_length;

/** @type {boolean} Whether or not to strip the text when tokenizing (removing excess spaces before and after the string). */
Expand Down

0 comments on commit edbf767

Please sign in to comment.