Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DonutSwin models (Closes #318) #320

Merged
merged 20 commits into from
Sep 26, 2023
Merged

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Sep 19, 2023

TODO

  • Step-by-step Document Image Classification Postponed due to bug in optimum
  • Step-by-step Document Parsing
  • Step-by-step Document Visual Question Answering (DocVQA)
  • Make it usable via pipeline API

Example usage

Examples adapted from https://huggingface.co/docs/transformers/model_doc/donut

Document parsing

let model_id = 'Xenova/donut-base-finetuned-cord-v2';

// Prepare image inputs
let processor = await AutoProcessor.from_pretrained(model_id);
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/receipt.png';
let image = await RawImage.read(url);
let image_inputs = await processor(image);

// Prepare decoder inputs
let task_prompt = "<s_cord-v2>"
let tokenizer = await AutoTokenizer.from_pretrained(model_id);
let decoder_input_ids = tokenizer(task_prompt, {
    add_special_tokens: false,
}).input_ids;

// Create the model
let model = await AutoModelForVision2Seq.from_pretrained(model_id);

// Run inference
let output = await model.generate(image_inputs.pixel_values, {
    decoder_input_ids,
    max_length: model.config.decoder.max_position_embeddings,
})

// Decode output
let decoded = tokenizer.batch_decode(output)[0];
console.log(decoded);
// <s_cord-v2><s_menu><s_nm> CINNAMON SUGAR</s_nm><s_unitprice> 17,000</s_unitprice><s_cnt> 1 x</s_cnt><s_price> 17,000</s_price></s_menu><s_sub_total><s_subtotal_price> 17,000</s_subtotal_price></s_sub_total><s_total><s_total_price> 17,000</s_total_price><s_cashprice> 20,000</s_cashprice><s_changeprice> 3,000</s_changeprice></s_total></s>

Document Visual Question Answering (DocVQA)

let model_id = 'Xenova/donut-base-finetuned-docvqa';

// Prepare image inputs
let processor = await AutoProcessor.from_pretrained(model_id);
let url = 'https://i.imgur.com/i3asmW8.png';
let image = await RawImage.read(url);
let image_inputs = await processor(image);

// Prepare decoder inputs
let question = 'What is the invoice number?';
let task_prompt = `<s_docvqa><s_question>${question}</s_question><s_answer>`
let tokenizer = await AutoTokenizer.from_pretrained(model_id);
let decoder_input_ids = tokenizer(task_prompt, {
    add_special_tokens: false,
}).input_ids;

// Create the model
let model = await AutoModelForVision2Seq.from_pretrained(model_id);

// Run inference
let output = await model.generate(image_inputs.pixel_values, {
    decoder_input_ids,
    max_length: model.config.decoder.max_position_embeddings,
})

// Decode output
let decoded = tokenizer.batch_decode(output)[0];
console.log(decoded);
// <s_docvqa><s_question> What is the invoice number?</s_question><s_answer> us-001</s_answer></s>

@xenova xenova linked an issue Sep 19, 2023 that may be closed by this pull request
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 19, 2023

The documentation is not available anymore as the PR was closed or merged.

@xenova
Copy link
Collaborator Author

xenova commented Sep 24, 2023

Adding Document Image Classification might be delayed due to huggingface/optimum#1412

@xenova xenova merged commit d307f27 into main Sep 26, 2023
4 checks passed
@xenova xenova changed the title Add support for DonutSwim models (Closes #318) Add support for DonutSwin models (Closes #318) Sep 26, 2023
@xenova xenova deleted the add-donut-support branch December 13, 2023 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Model request] Add support for donut models
2 participants