Skip to content

Official codes for paper "Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning".

Notifications You must be signed in to change notification settings

BestJuly/Pretext-Contrastive-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning

arXiv paper

Currently support PCL(VCP). For VCOP and 3DRotNet, codes are still in refactoring.

A sentence to conclude this paper: if you are developing novel pretext task-based methods for video self-supervised learning, do not hesitate to combine contrastive learning loss, which is simple to use and can boost the performance.

Highlights

  1. This paper represents a joint optimization method in self-supervised video representation learning, which can achieve high performance without proposing new pretext tasks;
  2. The effectiveness of our proposal is validated by 3 pretext task baselines and 4 different network backbones;
  3. The proposal is flexible enough to be applied to other methods.

Requirements

This is my experimental environment when preparing this demo code.

  • Ubuntu 18.04.4 LTS
  • conda 4.8.4
  • PyTorch 1.4.0

[Warning] We have met problems in other projects that different pytorch versions (1.7.0) might cause totally different results.

  • python 3.8.3
  • cuda 10.1
  • accimage

Usage

Data preparation

I used resized RGB frames from this repo. Frames of videos in UCF101 and HMDB51 datasets can be downloaded directly without decoding.

Tips: There is a folder called TSP_Flows inside v_LongJump_j18_c03 folder in UCF101 dataset and you may meet a problem if you do not handle this. One solution is to delete this folder.

The folder architecture is like path/to/dataset/jpegs_256/video_id/frames.jpg.

Then, you need to edit datasets/ucf101.py and datasets/hmdb51.py to specify the path for dataset. Please change *_dataset_path in line #19.

Training self-supervised learning part with our PCL

python train_vcp_contrast.py

Default settings are

  • Method: PCL (VCP)
  • Backbone: R3D
  • Modality: Res
  • Augmentation: RandomCrop, ColorJitter, RandomGrayScale, GaussianBlur, RandomHorizontalFlip
  • Dataset: UCF101
  • Split: 1

These settings are also fixed for the following process, we do not need to specify --model=r3d --modality=res.

The training will take around 33 hours on one V100 based on our experimental environment.

Models will be saved to ./logs/exp_name. Here, exp_name is directly generated by its corresponding settings.

Evaluation using video retrieval

python retrieve_clips.py --ckpt=/path/to/ssl/best_model --dataset=ucf101

Fine-tuning on video recognition

python ft_classify.py --ckpt=/path/to/ssl/best_model --dataset=ucf101

The testing process will automatically run after training is done.

Results

Video retrieval

Our PCL outperform a set of methods by a large margin. Here we list results using Resnet-18-3D as network backbone. For more results, please refer to our paper.

Methods Backbone Top1 Top5 Top10 Top20 Top50
Random R3D-18 15.3 25.1 32.1 40.8 53.7
3DRotNet R3D-18 14.2 25.2 33.5 43.7 59.5
VCP R3D-18 22.1 33.8 42.0 51.3 64.7
RTT R3D-18 26.1 48.5 59.1 69.6 82.8
PacePred R3D-18 23.8 38.1 46.4 56.6 69.8
IIC R3D-18 36.8 54.1 63.1 72.0 83.3
PCL (3DRotNet) R3D-18 33.7 53.5 64.1 73.4 85.0
PCL (VCP) R3D-18 55.1 71.2 78.9 85.5 92.3

Video recognition

The table lists recognition results on UCF101 and HMDB51 datasets. Other results are from corresponding paper. Because this is the most widely used metrics, we show results based on 4 different network backbones.

Methods in this table do not contain those using other data modalities such as sound and text.

Method Date Pre-train ClipSize Network UCF HMDB
OPN 2017 UCF 227x227 VGG 59.6 23.8
DPC 2019 K400 16x224x224 R3D-34 75.7 35.7
CBT 2019 K600+ 16x112x112 S3D 79.5 44.6
SpeedNet 2020 K400 64x224x224 S3D-G 81.1 48.8
MemDPC 2020 K400 40x224x224 R-2D3D 78.1 41.2
3D-RotNet 2018 K400 16x112x112 R3D-18 62.9 33.7
ST-Puzzle 2019 K400 16x112x112 R3D-18 65.8 33.7
DPC 2019 K400 16x128x128 R3D-18 68.2 34.5
RTT 2020 UCF 16x112x112 R3D-18 77.3 47.5
RTT 2020 K400 16x112x112 R3D-18 79.3 49.8
PCL (3DRotNet) UCF 16x112x112 R3D-18 82.8 47.2
PCL (VCP) UCF 16x112x112 R3D-18 83.4 48.8
PCL (VCP) K400 16x112x112 R3D-18 85.6 48.0
VCOP 2019 UCF 16x112x112 R3D 64.9 29.5
VCP 2020 UCF 16x112x112 R3D 66.0 31.5
PRP 2020 UCF 16x112x112 R3D 66.5 29.7
IIC 2020 UCF 16x112x112 R3D 74.4 38.3
PCL (VCOP) UCF 16x112x112 R3D 78.2 40.5
PCL (VCP) UCF 16x112x112 R3D 81.1 45.0
VCOP 2019 UCF 16x112x112 C3D 65.6 28.4
VCP 2020 UCF 16x112x112 C3D 68.5 32.5
PRP 2020 UCF 16x112x112 C3D 69.1 34.5
RTT 2020 K400 16x112x112 C3D 69.9 39.6
PCL (VCOP) UCF 16x112x112 C3D 79.8 41.8
PCL (VCP) UCF 16x112x112 C3D 81.4 45.2
VCOP 2019 UCF 16x112x112 R(2+1)D 72.4 30.9
VCP 2020 UCF 16x112x112 R(2+1)D 66.3 32.2
PRP 2020 UCF 16x112x112 R(2+1)D 72.1 35.0
RTT 2020 UCF 16x112x112 R(2+1)D 81.6 46.4
PacePred 2020 UCF 16x112x112 R(2+1)D 75.9 35.9
PacePred 2020 K400 16x112x112 R(2+1)D 77.1 36.6
PCL (VCOP) UCF 16x112x112 R(2+1)D 79.2 41.6
PCL (VCP) UCF 16x112x112 R(2+1)D 79.9 45.6
PCL (VCP) K400 16x112x112 R(2+1)D 85.7 47.4

Citation

If you find our work helpful for your research, please consider citing the paper

@article{tao2021pcl,
    title={Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning},
    author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
    journal={arXiv preprint arXiv:2010.15464},
    year={2021}
}

If you find the residual input helpful for video-related tasks, please consider citing the paper

@article{tao2020rethinking,
  title={Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition},
  author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2001.05661},
  year={2020}
}

@inproceedings{tao2020motion,
  title={Motion Representation Using Residual Frames with 3D CNN},
  author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
  booktitle={2020 IEEE International Conference on Image Processing (ICIP)},
  pages={1786--1790},
  year={2020},
  organization={IEEE}
}

Acknowledgements

Part of this code is reused from IIC and VCOP.

About

Official codes for paper "Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages