This is the official PyTorch implementation of Exploring Target Representations for Masked Autoencoders.
- January 2024 - The paper is accepted by ICLR 2024.
- November 2022 - Release the code and pre-trained models.
- September 2022 - Release the pre-print on arXiv.
Installation and preparation please follow MAE and iBOT. This repo is built upon python==3.6
, timm==0.4.12
and pytorch==1.9.0
.
See pre-training instruction for details.
See downstream instruction for details.
We provide the pre-trained model (pt. model
) and the finetuned model (ft. model
) of dBOT in each experimental setup. You can download the pre-trained models for downstream tasks. asym. enc-dec
being √
denotes that the decoder is appended after encoder with fixed delayed mask and sin-cos position embedding. It being ×
denotes that the vanillia ViT is used with no delayed mask and relative position embedding.
Arch. | Teacher | asym. enc-dec | cls. | det. | seg. | download | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
ViT-B | ViT-B | ✓ | 84.5% | 52.7 | 49.5 | pt. model | ft. model | pt. log | |||
ViT-L | ✓ | 84.6% | 53.1 | 50.1 | pt. model | ft. model | pt. log | ||||
ViT-H | ✓ | 84.6% | 53.5 | 50.8 | pt. model | ft. model | pt. log | ||||
CLIP-B/16 | ✘ | 85.7% | 53.6 | 52.9 | pt. model | ft. model | pt. log | ||||
ViT-L | ViT-L | ✓ | 86.6% | 56.0 | 54.5 | pt. model | ft. model | pt. log | |||
ViT-H | ✓ | 86.8% | 56.1 | 55.2 | pt. model | ft. model | pt. log | ||||
CLIP-L/14 | ✘ | 87.8% | 56.8 | 56.2 | pt. model | ft. model | pt. log | ||||
ViT-H | ViT-H | ✓ | 87.4% | - | - | pt. model | ft. model | pt. log | |||
CLIP-L/14 | ✘ | 88.5% | - | - | pt. model | ft. model | pt. log | ||||
ViT-H448 | ViT-H | ✓ | 88.0% | - | - | pt. model | ft. model | pt. log | |||
CLIP-L/14 | ✘ | 89.1% | - | - | pt. model | ft. model | pt. log |
🎯 This branch is the implementation of dBOT with default asymmetric encoder-decoder architecture. For symmetric architecture with which we use CLIP as the pre-trained teacher, please see beit branch for details.
To demonstrate models' differences in terms of their weigths and outputs, we conduct property analysis using averaged attention distance and singular value decomposition. We first compute the averaged attention distance for each attention head of different Transformer blocks. The results are averaged over IN1K validation set:
We also compute the percentage of tok-k (varing from 1 to 5) singular values of the embedding w.r.t each layer:
The student networks distilled from different initialized teachers exhibit similar behaviors, which clearly indicate that the teacher network does not matter with bootstrapped teachers.
This reposity is modified upon the MAE repository and iBOT repository.
This project is under the Apache 2.0 license as found in LICENSE file.
Please consider citing dBOT and giving a star if dBOT helps your research:
@article{liu2022exploring,
title={Exploring target representations for masked autoencoders},
author={Liu, Xingbin and Zhou, Jinghao and Kong, Tao and Lin, Xianming and Ji, Rongrong},
journal={arXiv preprint arXiv:2209.03917},
year={2022}
}