Skip to content

Latest commit

 

History

History
129 lines (89 loc) · 6.11 KB

README.md

File metadata and controls

129 lines (89 loc) · 6.11 KB

Plain-DETR

By Yutong Lin*, Yuhui Yuan*, Zheng Zhang*, Chen Li, Nanning Zheng and Han Hu*

This repo is the official implementation of "DETR Doesn’t Need Multi-Scale or Locality Design".

Introduction

We present an improved DETR detector that maintains a “plain” nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that re-introduce architectural inductive biases of multi-scale and locality into the decoder.

We show that two simple technologies are surprisingly effective within a plain design: 1) a box-to-pixel relative position bias (BoxRPB) term to guide each query to attend to the corresponding object region; 2) masked image modeling (MIM)-based backbone pre-training to help learn representation with fine-grained localization ability and to remedy dependencies on the multi-scale feature maps.

Main Results

BoxRPB MIM PT. Reparam. AP Paper Position CFG CKPT
37.2 Tab2 Exp1 cfg ckpt
46.1 Tab2 Exp2 cfg ckpt
48.7 Tab2 Exp5 cfg ckpt
50.9 Tab2 Exp6 cfg ckpt

Installation

Conda

# create conda environment
conda create -n plain_detr python=3.8 -y
conda activate plain_detr

# install pytorch (other versions may also work)
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Docker

We have tested with the docker image superbench/dev:cuda11.8. Other dockers may also work.

# run docker
sudo docker run -it -p 8022:22 -d --name=plain_detr --privileged --net=host --ipc=host --gpus=all -v /:/data superbench/dev:cuda11.8 bash
sudo docker exec -it plain_detr bash

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Usage

Dataset preparation

Please download COCO 2017 dataset and organize them as following:

code_root/
└── data/
    └── coco/
        ├── train2017/
        ├── val2017/
        └── annotations/
        	├── instances_train2017.json
        	└── instances_val2017.json

Pretrained models preparation

Please run the following script to download supervised and mask-image-modeling pretrained models.

(We adopts Swin Transformer v2 as the default backbone. If you are interested in the pretraining, please refer to Swin Transformer v2 (paper, github) and SimMIM (paper, github) for more details.)

bash tools/prepare_pt_model.sh

Training

Training on single node

GPUS_PER_NODE=<num gpus> ./tools/run_dist_launch.sh <num gpus> <path to config file>

Training on multiple nodes

On each node, run the following script:

MASTER_ADDR=<master node IP address> GPUS_PER_NODE=<num gpus> NODE_RANK=<rank> ./tools/run_dist_launch.sh <num gpus> <path to config file> 

Evaluation

To evalute a plain-detr model, please run the following script:

 <path to config file> --eval --resume <path to plain-detr model>

You could also use ./tools/run_dist_launch.sh to evaluate a model on multiple GPUs.

Limitation & Discussion

  • While we have eliminated multi-scale designs for the backbone output and decoder input, the generation of proposals still depends on multi-scale features.

    We have performed trials utilizing single-scale features for proposals(not included in the paper), but it led to ~1 mAP performance drop.

Known issues

  • Most of our experiments are conducted on 16 GPUs with 1 image per GPU. We have tested our released checkpoints with larger batch size and found that the performance of first three models drops significantly.

    We are now reviewing our implementation and will update our code to support larger batch size for both training and inference.

Citing Plain-DETR

If you find Plain-DETR useful in your research, please consider citing:

inproceedings{lin2023detr,
  title={DETR Does Not Need Multi-Scale or Locality Design},
  author={Lin, Yutong and Yuan, Yuhui and Zhang, Zheng and Li, Chen and Zheng, Nanning and Hu, Han},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6545--6554},
  year={2023}
}