Currently support PCL(VCP). For VCOP and 3DRotNet, codes are still in refactoring.
A sentence to conclude this paper: if you are developing novel pretext task-based methods for video self-supervised learning, do not hesitate to combine contrastive learning loss, which is simple to use and can boost the performance.
- This paper represents a joint optimization method in self-supervised video representation learning, which can achieve high performance without proposing new pretext tasks;
- The effectiveness of our proposal is validated by 3 pretext task baselines and 4 different network backbones;
- The proposal is flexible enough to be applied to other methods.
This is my experimental environment when preparing this demo code.
- Ubuntu 18.04.4 LTS
- conda 4.8.4
- PyTorch 1.4.0
[Warning] We have met problems in other projects that different pytorch versions (1.7.0) might cause totally different results.
- python 3.8.3
- cuda 10.1
- accimage
I used resized RGB frames from this repo. Frames of videos in UCF101 and HMDB51 datasets can be downloaded directly without decoding.
Tips: There is a folder called
TSP_Flows
insidev_LongJump_j18_c03 folder
in UCF101 dataset and you may meet a problem if you do not handle this. One solution is to delete this folder.
The folder architecture is like path/to/dataset/jpegs_256/video_id/frames.jpg
.
Then, you need to edit datasets/ucf101.py
and datasets/hmdb51.py
to specify the path for dataset. Please change *_dataset_path
in line #19.
python train_vcp_contrast.py
Default settings are
- Method: PCL (VCP)
- Backbone: R3D
- Modality: Res
- Augmentation: RandomCrop, ColorJitter, RandomGrayScale, GaussianBlur, RandomHorizontalFlip
- Dataset: UCF101
- Split: 1
These settings are also fixed for the following process, we do not need to specify --model=r3d --modality=res
.
The training will take around 33 hours on one V100 based on our experimental environment.
Models will be saved to ./logs/exp_name
. Here, exp_name
is directly generated by its corresponding settings.
python retrieve_clips.py --ckpt=/path/to/ssl/best_model --dataset=ucf101
python ft_classify.py --ckpt=/path/to/ssl/best_model --dataset=ucf101
The testing process will automatically run after training is done.
Our PCL outperform a set of methods by a large margin. Here we list results using Resnet-18-3D as network backbone. For more results, please refer to our paper.
Methods | Backbone | Top1 | Top5 | Top10 | Top20 | Top50 |
---|---|---|---|---|---|---|
Random | R3D-18 | 15.3 | 25.1 | 32.1 | 40.8 | 53.7 |
3DRotNet | R3D-18 | 14.2 | 25.2 | 33.5 | 43.7 | 59.5 |
VCP | R3D-18 | 22.1 | 33.8 | 42.0 | 51.3 | 64.7 |
RTT | R3D-18 | 26.1 | 48.5 | 59.1 | 69.6 | 82.8 |
PacePred | R3D-18 | 23.8 | 38.1 | 46.4 | 56.6 | 69.8 |
IIC | R3D-18 | 36.8 | 54.1 | 63.1 | 72.0 | 83.3 |
PCL (3DRotNet) | R3D-18 | 33.7 | 53.5 | 64.1 | 73.4 | 85.0 |
PCL (VCP) | R3D-18 | 55.1 | 71.2 | 78.9 | 85.5 | 92.3 |
The table lists recognition results on UCF101 and HMDB51 datasets. Other results are from corresponding paper. Because this is the most widely used metrics, we show results based on 4 different network backbones.
Methods in this table do not contain those using other data modalities such as sound and text.
Method | Date | Pre-train | ClipSize | Network | UCF | HMDB |
---|---|---|---|---|---|---|
OPN | 2017 | UCF | 227x227 | VGG | 59.6 | 23.8 |
DPC | 2019 | K400 | 16x224x224 | R3D-34 | 75.7 | 35.7 |
CBT | 2019 | K600+ | 16x112x112 | S3D | 79.5 | 44.6 |
SpeedNet | 2020 | K400 | 64x224x224 | S3D-G | 81.1 | 48.8 |
MemDPC | 2020 | K400 | 40x224x224 | R-2D3D | 78.1 | 41.2 |
3D-RotNet | 2018 | K400 | 16x112x112 | R3D-18 | 62.9 | 33.7 |
ST-Puzzle | 2019 | K400 | 16x112x112 | R3D-18 | 65.8 | 33.7 |
DPC | 2019 | K400 | 16x128x128 | R3D-18 | 68.2 | 34.5 |
RTT | 2020 | UCF | 16x112x112 | R3D-18 | 77.3 | 47.5 |
RTT | 2020 | K400 | 16x112x112 | R3D-18 | 79.3 | 49.8 |
PCL (3DRotNet) | UCF | 16x112x112 | R3D-18 | 82.8 | 47.2 | |
PCL (VCP) | UCF | 16x112x112 | R3D-18 | 83.4 | 48.8 | |
PCL (VCP) | K400 | 16x112x112 | R3D-18 | 85.6 | 48.0 | |
VCOP | 2019 | UCF | 16x112x112 | R3D | 64.9 | 29.5 |
VCP | 2020 | UCF | 16x112x112 | R3D | 66.0 | 31.5 |
PRP | 2020 | UCF | 16x112x112 | R3D | 66.5 | 29.7 |
IIC | 2020 | UCF | 16x112x112 | R3D | 74.4 | 38.3 |
PCL (VCOP) | UCF | 16x112x112 | R3D | 78.2 | 40.5 | |
PCL (VCP) | UCF | 16x112x112 | R3D | 81.1 | 45.0 | |
VCOP | 2019 | UCF | 16x112x112 | C3D | 65.6 | 28.4 |
VCP | 2020 | UCF | 16x112x112 | C3D | 68.5 | 32.5 |
PRP | 2020 | UCF | 16x112x112 | C3D | 69.1 | 34.5 |
RTT | 2020 | K400 | 16x112x112 | C3D | 69.9 | 39.6 |
PCL (VCOP) | UCF | 16x112x112 | C3D | 79.8 | 41.8 | |
PCL (VCP) | UCF | 16x112x112 | C3D | 81.4 | 45.2 | |
VCOP | 2019 | UCF | 16x112x112 | R(2+1)D | 72.4 | 30.9 |
VCP | 2020 | UCF | 16x112x112 | R(2+1)D | 66.3 | 32.2 |
PRP | 2020 | UCF | 16x112x112 | R(2+1)D | 72.1 | 35.0 |
RTT | 2020 | UCF | 16x112x112 | R(2+1)D | 81.6 | 46.4 |
PacePred | 2020 | UCF | 16x112x112 | R(2+1)D | 75.9 | 35.9 |
PacePred | 2020 | K400 | 16x112x112 | R(2+1)D | 77.1 | 36.6 |
PCL (VCOP) | UCF | 16x112x112 | R(2+1)D | 79.2 | 41.6 | |
PCL (VCP) | UCF | 16x112x112 | R(2+1)D | 79.9 | 45.6 | |
PCL (VCP) | K400 | 16x112x112 | R(2+1)D | 85.7 | 47.4 |
If you find our work helpful for your research, please consider citing the paper
@article{tao2021pcl,
title={Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning},
author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
journal={arXiv preprint arXiv:2010.15464},
year={2021}
}
If you find the residual input helpful for video-related tasks, please consider citing the paper
@article{tao2020rethinking,
title={Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition},
author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
journal={arXiv preprint arXiv:2001.05661},
year={2020}
}
@inproceedings{tao2020motion,
title={Motion Representation Using Residual Frames with 3D CNN},
author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
booktitle={2020 IEEE International Conference on Image Processing (ICIP)},
pages={1786--1790},
year={2020},
organization={IEEE}
}