Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is MinVIS truly online? #2

Open
timmeinhardt opened this issue Aug 30, 2022 · 2 comments
Open

Is MinVIS truly online? #2

timmeinhardt opened this issue Aug 30, 2022 · 2 comments

Comments

@timmeinhardt
Copy link

timmeinhardt commented Aug 30, 2022

First of all, congratulations on this paper. It was a very interesting read. However, I think technically MinVIS can not be considered an online method. You are processing each frame separately but an online method must not include information of future frames for the decision making on the current frame. In this line

out_logits = sum(out_logits)/len(out_logits)

you compute mean scores for each query and class across the entire sequence. These scores are later used for the topk selection of final outputs. While your frame processing might be online, the utilization of information of all frames at once means your decision making is not. Please clarify what I might be misunderstanding or your point of view on the matter. Thank you!

@JialianW
Copy link

JialianW commented Sep 1, 2022

To my understanding, this is only for evaluation purpose. VIS requires each tacklet to have a score in order to compute mAP. While the most straightforward way to get a score for a tracklet is to average across frames. I don't think this part would be used in real streaming applications.

@timmeinhardt
Copy link
Author

timmeinhardt commented Sep 1, 2022

I agree that the GT and prediction file design of YouTube-VIS and OVIS both invite to process their data in an offline fashion but it is possible to generate appropriate full-sequence tracks even for true online methods. For example, IDOL not only processes sequences online but never uses information of future frames for the mask/score prediction of the current frame. This requires full frame to frame track management and to suffice the VIS GT format they retroactively fill in missing/occluded frames with zeros (see here).

In a real world streaming application an online track management would also be necessary. You can not apply your current method to a video stream and produce reasonable tracks considering objects getting occluded or entering/leaving the sequence. But most importantly when it comes to the evaluation and comparability of methods computing scores over the full sequence is usually considered to be offline. Using global information/averaging would probably benefit IDOL as well. Hence, I think the comparison is not fair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants