TimeSformer Video Classification Model

Content

Introduction
Data
Train
Test
Inference
Reference

Introduction

We have improved the TimeSformer model and obtained a more accurate 2D practical video classification model PP-TimeSformer. Without increasing the amount of parameters and calculations, the accuracy on the UCF-101, Kinetics-400 and other data sets significantly exceeds the original version. The accuracy on the Kinetics-400 data set is shown in the table below.

Version	Top1
Ours (distill+16frame)	79.49
Ours (distill)	78.82
Ours	78.54
mmaction2	77.92

Data

K400 data download and preparation please refer to Kinetics-400 data preparation

UCF101 data download and preparation please refer to UCF-101 data preparation

Train

Kinetics-400 data set training

Download and add pre-trained models

Download the image pre-training model ViT_base_patch16_224_miil_21k.pdparams as Backbone initialization parameters, or download through wget command
```
wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ViT_base_patch16_224_pretrained.pdparams
```

Open PaddleVideo/configs/recognition/pptimesformer/pptimesformer_k400_videos.yaml, and fill in the downloaded weight storage path below pretrained:

MODEL:
    framework: "RecognizerTransformer"
    backbone:
        name: "VisionTransformer_tweaks"
        pretrained: fill in the path here

Start training

The Kinetics400 data set uses 8 cards for training, and the start command of the training method is as follows:

# videos data format
python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir=log_pptimesformer main.py --validate -c configs/recognition/ pptimesformer/pptimesformer_k400_videos.yaml

Turn on amp mixed-precision training to speed up the training process. The training start command is as follows:

export FLAGS_conv_workspace_size_limit=800 # MB
export FLAGS_cudnn_exhaustive_search=1
export FLAGS_cudnn_batchnorm_spatial_persistent=1
# videos data format
python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir=log_pptimesformer main.py --amp --validate -c configs /recognition/pptimesformer/pptimesformer_k400_videos.yaml

In addition, you can customize and modify the parameter configuration to achieve the purpose of training/testing on different data sets. It is recommended that the naming method of the configuration file is model_dataset name_file format_data format_sampling method.yaml , Please refer to config for parameter usage.

Test

The PP-TimeSformer model is verified synchronously during training. You can find the keyword best in the training log to obtain the model test accuracy. The log example is as follows:
```
Already save the best model (top1 acc)0.7258
```

Because the sampling method of the PP-TimeSformer model test mode is a slightly slower but higher accuracy UniformCrop, which is different from the RandomCrop used in the verification mode during the training process, so the verification index recorded in the training log topk Acc does not represent the final test score, so after the training is completed, you can use the test mode to test the best model to obtain the final index. The command is as follows:

python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir=log_pptimesformer main.py --test -c configs/recognition/ pptimesformer/pptimesformer_k400_videos.yaml -w "output/ppTimeSformer/ppTimeSformer_best.pdparams"

When the test configuration uses the following parameters, the test indicators on the validation data set of Kinetics-400 are as follows:

backbone	Sampling method	num_seg	target_size	Top-1	checkpoints
Vision Transformer	UniformCrop	8	224	78.54	ppTimeSformer_k400_8f.pdparams
Vision Transformer	UniformCrop	8	224	78.82	ppTimeSformer_k400_8f_distill.pdparams
Vision Transformer	UniformCrop	16	224	79.49	ppTimeSformer_k400_16f_distill.pdparams

During the test, the PP-TimeSformer video sampling strategy is to use linspace sampling: in time sequence, from the first frame to the last frame of the video sequence to be sampled, num_seg sparse sampling points (including endpoints) are uniformly generated; spatially , Select 3 areas to sample at both ends of the long side and the middle position (left, middle, right or top, middle, and bottom). A total of 1 clip is sampled for 1 video.

Inference

Export inference model

python3.7 tools/export_model.py -c configs/recognition/pptimesformer/pptimesformer_k400_videos.yaml \
                                -p data/ppTimeSformer_k400_8f.pdparams \
                                -o inference/ppTimeSformer

The above command will generate the model structure file ppTimeSformer.pdmodel and the model weight file ppTimeSformer.pdiparams required for prediction.

For the meaning of each parameter, please refer to [Model Reasoning Method](../../start.md#2-Model Reasoning)

Use predictive engine inference

python3.7 tools/predict.py --input_file data/example.avi \
                           --config configs/recognition/pptimesformer/pptimesformer_k400_videos.yaml \
                           --model_file inference/ppTimeSformer/ppTimeSformer.pdmodel \
                           --params_file inference/ppTimeSformer/ppTimeSformer.pdiparams \
                           --use_gpu=True \
                           --use_tensorrt=False

The output example is as follows:

Current video file: data/example.avi
        top-1 class: 5
        top-1 score: 0.9997474551200867

It can be seen that using the ppTimeSformer model trained on Kinetics-400 to predict data/example.avi, the output top1 category id is 5, and the confidence is 0.99. By referring to the category id and name correspondence table data/k400/Kinetics-400_label_list.txt, it can be known that the predicted category name is archery.

Reference

Is Space-TimeAttention All You Need for Video Understanding?, Gedas Bertasius, Heng Wang, Lorenzo Torresani
Distilling the Knowledge in a Neural Network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean
Averaging Weights Leads to Wider Optima and Better Generalization, Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov
ImageNet-21K Pretraining for the Masses, Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pp-timesformer.md

pp-timesformer.md

TimeSformer Video Classification Model

Content

Introduction

Data

Train

Kinetics-400 data set training

Download and add pre-trained models

Start training

Test

Inference

Export inference model

Use predictive engine inference

Reference

Files

pp-timesformer.md

Latest commit

History

pp-timesformer.md

File metadata and controls

TimeSformer Video Classification Model

Content

Introduction

Data

Train

Kinetics-400 data set training

Download and add pre-trained models

Start training

Test

Inference

Export inference model

Use predictive engine inference

Reference