By Yutong Lin*, Yuhui Yuan*, Zheng Zhang*, Chen Li, Nanning Zheng and Han Hu*
This repo is the official implementation of "DETR Doesn’t Need Multi-Scale or Locality Design".
We present an improved DETR detector that maintains a “plain” nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that re-introduce architectural inductive biases of multi-scale and locality into the decoder.
We show that two simple technologies are surprisingly effective within a plain design: 1) a box-to-pixel relative position bias (BoxRPB) term to guide each query to attend to the corresponding object region; 2) masked image modeling (MIM)-based backbone pre-training to help learn representation with fine-grained localization ability and to remedy dependencies on the multi-scale feature maps.
BoxRPB | MIM PT. | Reparam. | AP | Paper Position | CFG | CKPT |
---|---|---|---|---|---|---|
✗ | ✗ | ✗ | 37.2 | Tab2 Exp1 | cfg | ckpt |
✓ | ✗ | ✗ | 46.1 | Tab2 Exp2 | cfg | ckpt |
✓ | ✓ | ✗ | 48.7 | Tab2 Exp5 | cfg | ckpt |
✓ | ✓ | ✓ | 50.9 | Tab2 Exp6 | cfg | ckpt |
# create conda environment
conda create -n plain_detr python=3.8 -y
conda activate plain_detr
# install pytorch (other versions may also work)
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt
We have tested with the docker image superbench/dev:cuda11.8
. Other dockers may also work.
# run docker
sudo docker run -it -p 8022:22 -d --name=plain_detr --privileged --net=host --ipc=host --gpus=all -v /:/data superbench/dev:cuda11.8 bash
sudo docker exec -it plain_detr bash
# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt
Please download COCO 2017 dataset and organize them as following:
code_root/
└── data/
└── coco/
├── train2017/
├── val2017/
└── annotations/
├── instances_train2017.json
└── instances_val2017.json
Please run the following script to download supervised and mask-image-modeling pretrained models.
(We adopts Swin Transformer v2 as the default backbone. If you are interested in the pretraining, please refer to Swin Transformer v2 (paper, github) and SimMIM (paper, github) for more details.)
bash tools/prepare_pt_model.sh
GPUS_PER_NODE=<num gpus> ./tools/run_dist_launch.sh <num gpus> <path to config file>
On each node, run the following script:
MASTER_ADDR=<master node IP address> GPUS_PER_NODE=<num gpus> NODE_RANK=<rank> ./tools/run_dist_launch.sh <num gpus> <path to config file>
To evalute a plain-detr model, please run the following script:
<path to config file> --eval --resume <path to plain-detr model>
You could also use ./tools/run_dist_launch.sh
to evaluate a model on multiple GPUs.
-
While we have eliminated multi-scale designs for the backbone output and decoder input, the generation of proposals still depends on multi-scale features.
We have performed trials utilizing single-scale features for proposals(not included in the paper), but it led to ~1 mAP performance drop.
-
Most of our experiments are conducted on 16 GPUs with 1 image per GPU. We have tested our released checkpoints with larger batch size and found that the performance of first three models drops significantly.
We are now reviewing our implementation and will update our code to support larger batch size for both training and inference.
If you find Plain-DETR useful in your research, please consider citing:
inproceedings{lin2023detr,
title={DETR Does Not Need Multi-Scale or Locality Design},
author={Lin, Yutong and Yuan, Yuhui and Zhang, Zheng and Li, Chen and Zheng, Nanning and Hu, Han},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={6545--6554},
year={2023}
}