Description #1

TongkunGuan · 2024-06-11T14:31:04Z

How to understand: We employ the HERSHEY font at a size of 10px. On average, one 16x16 patch accommodates approximately 1.5 OPT text tokens. A 224x224 text image contains about 294 text tokens. Consequently, a visual encoder operating on this rendered text image requires only 1/3 of tokens to encode an equivalent amount of text, compared to the text tokenizer in language models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Description #1

Description #1

TongkunGuan commented Jun 11, 2024

Description #1

Description #1

Comments

TongkunGuan commented Jun 11, 2024