-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add icml smarter paper to publications
- Loading branch information
Lucas Roberts
authored and
Lucas Roberts
committed
Oct 8, 2024
1 parent
50b4681
commit 81f01d9
Showing
3 changed files
with
99 additions
and
14 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
98 changes: 98 additions & 0 deletions
98
_publications/2024-07-26-smart-vision-language-reasoners.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
--- | ||
title: "Smart Vision-Language Reasoners" | ||
collection: publications | ||
category: conferences | ||
permalink: /publication/2024-07-26-smart-vision-language-reasoners | ||
excerpt: 'Images in Math AI Considered harmful? Not quite. | ||
This paper demonstrates improved performance on question answering problems in math | ||
by customizing the neural network architecture to pool information from both vision and text backbones. | ||
The improvements come from a custom QF layer which includes multihead self attention layers | ||
as well as a cross attention layer (vision & text). We fine tune the model using the smart-101 dataset presented in CVPR 2023.' | ||
date: 2024-07-26 | ||
venue: 'ICML 2024' | ||
paperurl: 'https://smarter-vlm.github.io/smarter-vlm/' | ||
citation: 'Your Name, You. (2024). "Paper Title Number 3." <i>GitHub Journal of Bugs</i>. 1(3).' | ||
--- | ||
|
||
The architecture | ||
================ | ||
|
||
We freeze the vision and text backbones and add some layers on top to promote | ||
pooling of visual and textual features and also to reduce the cost of fine | ||
tuning. The QF layer and the QF Fusion layer contain multihead self attention and | ||
cross attentions. Typical fully connected layers on top and on the decoder side | ||
we use a GRU layer. The GRU is used because it usually performs as well or better than other layers like LSTM while also having fewer parameters to update on the backward pass. | ||
|
||
The data | ||
======== | ||
|
||
The smart-101 dataset consists of a collection of questions with answers. | ||
Each of the problems contains a base collection of text and 5 potential | ||
answers. Each problem is actually a class (or collection) of problems. | ||
The actual data is generated by code and images are also generated for each of | ||
the 101 problem classes. The images are generated using openCV. Human level | ||
performance on these data are quantified via the Math Kangaroo program. | ||
|
||
For more details on smart-101 see [smart-101](https://smartdataset.github.io/smart101/). | ||
For more details on math kangaroo see the [math kangaroo site](https://mathkangaroo.org/mks). | ||
|
||
The findings | ||
============ | ||
|
||
We find that there is improved performance over the baselines used in the smart-101 paper. | ||
|
||
Critiques | ||
========= | ||
|
||
Some common critiques I heard at the ICML workshop and since then: | ||
|
||
1. We didn't do enough epoches for fine tuning. | ||
|
||
We were limited by amount of GPU compute time, and also the vision and text backbones | ||
are large models. Keep in mind this wasn't work sponsored by my employer so we | ||
did not have unlimited access to A100 GPUs to fine tune and perform extensive | ||
ablations. We followed a recipe outlined by [Andrei Karpathy](https://karpathy.github.io/2019/04/25/recipe/) which remains excellent advice despite the fast moving nature of the space. | ||
|
||
2. Images in math ai considered harmful. | ||
|
||
Another common critique I heard is that there are lots of other Math AI papers | ||
who have investigated the use of images and found they are either not helpful | ||
or harmful (images in math ai considered harmful). If you read these other papers | ||
you will find-in all the ones I've read-they do not customize the network architectures nor dothe have cross attention layers which pool information from the textual and image backbones. | ||
We sought to disprove the claims that images in math ai considered harmful and we did. | ||
|
||
In fact this was actually a panel discussion question during the workshop, "is text | ||
alone enough?" the consensus was that while images may not be necessary they are sufficient. | ||
|
||
That's math speak for images will help because there is a lot of data/information | ||
contained inside the images but that math problems can be solved without images. | ||
|
||
|
||
3. Images in math ai found not to help | ||
|
||
This is a finding in the MathVerse paper, e.g. the model learns to shortcut the | ||
vision features and rely primarily on the text features of the problem. | ||
My commentary on this is the same as item 2 above, the architectures did not pool | ||
the visual and textual information. | ||
|
||
My Opinion on Images | ||
==================== | ||
|
||
While I do not disagree with the premises of the panel, it seems to me a bit like | ||
bringing a knife to a gunfight, or to use a less violent metaphor, playing chess | ||
[blindfold or sans voir](https://en.wikipedia.org/wiki/Blindfold_chess) against your opponent. There are some quite talented chess players and some that can play blindfold extremely well, however, most will say there performance playing blindfold is hindered over playing sighted. | ||
|
||
To be honest though, neither I not others in the community have an "answer" on | ||
the question of images in math ai. | ||
|
||
Conclusion | ||
========== | ||
|
||
While it remains to be seen whether purely text models can perform as well in the | ||
math ai domain, our work suggests that there are several aspects of the problem | ||
hitherto unconsidered. | ||
|
||
If you are a researcher or institution who would like to work together in this space-or fund our investigations-please get in touch! | ||
|
||
My email contact is there on the left hand side of the screen only a click away. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters