-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is MinVIS truly online? #2
Comments
To my understanding, this is only for evaluation purpose. VIS requires each tacklet to have a score in order to compute mAP. While the most straightforward way to get a score for a tracklet is to average across frames. I don't think this part would be used in real streaming applications. |
I agree that the GT and prediction file design of YouTube-VIS and OVIS both invite to process their data in an offline fashion but it is possible to generate appropriate full-sequence tracks even for true online methods. For example, IDOL not only processes sequences online but never uses information of future frames for the mask/score prediction of the current frame. This requires full frame to frame track management and to suffice the VIS GT format they retroactively fill in missing/occluded frames with zeros (see here). In a real world streaming application an online track management would also be necessary. You can not apply your current method to a video stream and produce reasonable tracks considering objects getting occluded or entering/leaving the sequence. But most importantly when it comes to the evaluation and comparability of methods computing scores over the full sequence is usually considered to be offline. Using global information/averaging would probably benefit IDOL as well. Hence, I think the comparison is not fair. |
First of all, congratulations on this paper. It was a very interesting read. However, I think technically MinVIS can not be considered an online method. You are processing each frame separately but an online method must not include information of future frames for the decision making on the current frame. In this line
MinVIS/minvis/video_maskformer_model.py
Line 308 in 3038871
you compute mean scores for each query and class across the entire sequence. These scores are later used for the topk selection of final outputs. While your frame processing might be online, the utilization of information of all frames at once means your decision making is not. Please clarify what I might be misunderstanding or your point of view on the matter. Thank you!
The text was updated successfully, but these errors were encountered: