This document provides a brief intro of the usage of MinVIS.
Please see Getting Started with Detectron2 for full usage.
We provide a script train_net_video.py
, that is made to train all the configs provided in MinVIS.
To train a model with "train_net_video.py", first setup the corresponding datasets following datasets/README.md, then download the COCO pre-trained instance segmentation weights (R50, Swin-L) and put them in the current working directory. Once these are set up, run:
python train_net_video.py --num-gpus 8 \
--config-file configs/youtubevis_2019/video_maskformer2_R50_bs32_8ep_frame.yaml
If the COCO pre-trained weights are in other locations, then add MODEL.WEIGHTS /path/to/pretrained_weights
at the end to point to their locations. In addition, the configs are made for 8-GPU training for ResNet-50 and 16-GPU training for Swin-L.
Since we use ADAMW optimizer, it is not clear how to scale learning rate with batch size.
To train on 1 GPU, you need to figure out learning rate and batch size by yourself:
python train_net_video.py \
--config-file configs/youtubevis_2019/video_maskformer2_R50_bs32_8ep_frame.yaml \
--num-gpus 1 SOLVER.IMS_PER_BATCH SET_TO_SOME_REASONABLE_VALUE SOLVER.BASE_LR SET_TO_SOME_REASONABLE_VALUE
To evaluate a model's performance, use
python train_net_video.py \
--config-file configs/youtubevis_2019/video_maskformer2_R50_bs32_8ep_frame.yaml \
--eval-only MODEL.WEIGHTS /path/to/checkpoint_file
For more options, see python train_net_video.py -h
.
- Pick a trained model and its config file. To start, you can pick from
model zoo,
for example,
configs/youtubevis_2019/video_maskformer2_R50_bs32_8ep_frame.yaml
. - We provide
demo.py
to visualize outputs of a trained model. Run it with:
cd demo_video/
python demo.py --config-file ../configs/youtubevis_2019/video_maskformer2_R50_bs32_8ep_frame.yaml \
--input /path/to/video/frames \
--output /output/folder \
[--other-options]
--opts MODEL.WEIGHTS /path/to/checkpoint_file
The configs are made for training, therefore we need to specify MODEL.WEIGHTS
to a model for evaluation. The input is a folder containing video frames saved as images. For example, ytvis_2019/valid/JPEGImages/00f88c4f0a
.