Paper: 基于注意力机制的无人机对地多目标跟踪
What's new:
add JDE with Swin-S and Swin-B
update cfg files and train.py
This repo is JDE(Joint Detection and Embedding) with Swin-Transformer backbone in VisDrone2019-MOT dataset, The code is built on JDE and Swin Transformer.
The structure of this model is as follow:
Result on VisDrone2019-MOT test: (Used ByteTrack, high thresh=0.6, low thresh=0.2)
IDF1 | Recall | Precision | FP | FN | MOTA | MOTP | FPS | |
---|---|---|---|---|---|---|---|---|
JDE(with DarkNet53 backbone) | 45.0 | 48.7 | 91.4 | 5777 | 64672 | 42.4 | 0.235 | 17.84 |
JDE(with Swin-T backbone) | 48.2 | 54.6 | 88.7 | 8784 | 57202 | 45.9 | 0.249 | 23.55 |
JDE(with Swin-S backbone) | 49.5 | 56.6 | 85.5 | 12094 | 54779 | 45.1 | 0.263 | 15.78 |
JDE(with Swin-B backbone) | 47.2 | 53.9 | 87.6 | 9589 | 58191 | 44.3 | 0.247 | 15.87 |
Training details: JDE with Swin-T backbone is trained with:
- Swin-T ImageNet pretrained model
- Half train dataset, 27seqs
- batch size=32,
- optimizer AdamW, init lr=3e-4
- 40Epochs, test with the best mAP model during training(which is 33rd epoch), lr x 0.1 at 31st epoch and 37th epoch(follow Swin Transformer paper)
- 2 Tesla A100 GPUs, about 5 hours
JDE with DarkNet is similar. JDE with Swin-S is similar too, but due to GPU memory, the batch size is 24. And the batch size of JDE with Swin-B is 16.
Trained model:
Baidu Link(JDE with Swin-T): link
code:ngm1
TODO:
I will train MOT17 dataset to compare with DarkNet again
and I will try to reach better result on VisDrone.
Follow JDE installation is good. My env is:
- python=3.7.0 pytorch=1.7.0 torchvision=0.8.0 cudatoolkit=11.0
you also need:
- py-motmetrics (
pip install motmetrics
) - cython-bbox (
pip install cython_bbox
) - opencv
in order to use Swin Transformer, please install mmdetection:
pip install openmim
mim install mmdet
Firstly you should generate image and annotations path following JDE format(see appendix):
For VisDrone dataset, you can run: part train dataset (27 seqs):
part train dataset:
python generate_labels_for_VisDronev2.py --if_certain_seqs
full train dataset:
python generate_labels_for_VisDronev2.py
generate test dataset path:
python generate_labels_for_VisDronev2.py --split 'VisDrone2019-MOT-test-dev'
Then train
if you want to use the Swin-T pretrained model, please download the model in Swin Transformer
(choose the Swin-T for Mask RCNN) and rename it as 'swin_t.pth', and put it in 'weights/'.
train with swin backbone:
python train.py --backbone 'swin' --cfg 'cfg/yolov3_1088x608_newanchor3-swin_t.cfg'
Of course, if you want to use Swin-S, switch to yolov3_1088x608_newanchor3-swin_s.cfg
if you want to train on your own dataset, please modify the anchors in cfg file. You can use k-means cluster to choose your anchor size:
python choose_anchors.py choose_anchors.py
With multi GPUs:
CUDA_VISIBLE_DEVICES=2,3 python train.py --backbone 'swin' --cfg 'cfg/yolov3_1088x608_newanchor3-swin_t.cfg'
After training, you can test the model by:
python track.py --cfg 'cfg/yolov3_1088x608_newanchor3-swin_t.cfg' --weights 'weights/vis_40Epochs_anchor3_lr3e-4_swin_wd1e-2/best_mAP.pt' --test_visdrone --byte_track --save-images
Generally you need to modify the weights path, and if you don't want to use byte track and save images, delete the '--byte_track' and '--save-images'.
more details, check run_JDE.txt.
Appendix:
JDE annotation format:(see JDE)