-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the model explanation #10
Comments
I saw little difference at the backbone. The paper uses ViT and this work uses CNN. |
Hey, Thanks for the feedback. This work is inspired from Facebook AI's (Detection Transformer) which aims to do object detection with transformers. The paper you've enclosed is very recent work on this similar topic, but they have not provided any implementation. |
Thank you for your reply. I think I understand the structure of your work. Thank you!! |
Hi @saahiluppal, I am trying to understand where the object detection part is occurring in the code, and what exact algorithm you're using. |
Hey, Image is fed to a resnet and this backbone will give us the feature embedding along with the corresponding mask for the image. That is the versatility of attention mechanism. |
PS: Recent research shows that doing "Object Detection" prior to "Image Captioning" doesn't bring any additional improvement, instead it will just increase complexity. |
Hi. Would you let me know what is the paper you referenced? Thank you. |
I've read it in ablation studies of some paper, not sure which paper. |
Have you found which paper the structure of this code refers to?Thanks |
Hi. Thank you for your impressive work.
I've read your work and want to understand your model clearly.
From #2 , I know there is no paper, but I found similar paper with your work.
Does the figure below explain your work?
Thank you!
The text was updated successfully, but these errors were encountered: