Skip to content

An Instruction-tuned Audio-Visual Language Model for Hate Content Detection

License

BSD-3-Clause and 2 other licenses found

Licenses found

BSD-3-Clause
LICENSE
BSD-3-Clause
LICENSE_Lavis.md
BSD-3-Clause
LICENSE_Minigpt4.md
Notifications You must be signed in to change notification settings

anishabhatnagar/Hate-LLaMA

Hate-LLaMA

Hate-LLaMA : An Instruction-tuned Audio-Visual Language Model for Hate Content Detection

Hate speech detection in online videos is an important but challenging problem, especially with the rise of video-sharing platforms. Existing solutions rely primarily on unimodal models focused on text or image inputs, with less emphasis on multimodal models that analyze both visual and audio aspects of videos. We present Hate-LLama, an instruction-tuned audio-visual language model fine-tuned on a labeled hate speech video dataset named HateMM. Hate-LLaMA is a finetuned version of Video-LLaMA. It accepts video input and makes hate speech classifications by analyzing both visual frames and audio in a multimodal fashion. Hate-LLaMA efficiently detects hate content with an accuracy of 71%.

Another major challenge of hate speech detection on videos is the scarcity of labeled video datasets, hence we also propose a benchmark dataset of around 300 videos consisting of 33% hate and 67% non-hate content.

Examples

non-hate classification

non-hate classification

Prerequisites

Environment Setup

conda env create -f environment.yml
conda activate hatellama
pip install -r requiremens.txt

Checkpoints and dataset

download and move the checkpoints to /ckpt folder.

  • Download meta-llama/Llama-2-7b-chat-hf from huggingface.

  • Download checkpoints for finetuned audio and video branch Hate-LLaMA and imagebind encoder from here

  • To download our curated benchmark, click here

  • For the HateMM dataset, please refer

DEMO

To run execute the demo

pip install -r requirements-demo.txt

python3 app.py

Executing the demo requires one gpu (preferably RTX8000/A100).

Finetuning

Adapting the dataset to instruction-tuning format, use convert-data.py python script.

For pretrained video-llama checkpoints, please refer.

To finetune the audio and video branches using these pretrained checkpoints -

configure the checkpoints and hyperparameters inside the audiobranch_stage2_finetune.yaml and visionbranch_stage2_finetune.yaml

conda activate hatellama

# Finetune the Vision-language branch
torchrun --nproc_per_node=4 train.py --cfg-path  ./train_configs/visionbranch_stage2_finetune.yaml

# Finetune the Audio-language branch
torchrun --nproc_per_node=4 train.py --cfg-path  ./train_configs/audiobranch_stage2_finetune.yaml

The finetuning process was done over 4 RTX8000 GPUs.

Inference

To evaluate the models performance against the test sets :

python inference.py --gpu-id=0 --cfg-path="eval_configs/video_llama_eval_withaudio_stage3.yaml” --ckpt_root="output/"

to compute accuracy and F-1 score:

unzip Results.npz
python compute_metrics.py whole_results.npy

Additional Information

We also provide the code to crawl bitchute platform and curate the benchmark in /benchmark.
The updated code to run the HateMM baseline for our benchmark is provided in /baseline.

Acknowledgements

We are grateful for the following open-source repos that helped us build our project

  1. Video-LLaMA
  2. HateMM

Contributors

About

An Instruction-tuned Audio-Visual Language Model for Hate Content Detection

Resources

License

BSD-3-Clause and 2 other licenses found

Licenses found

BSD-3-Clause
LICENSE
BSD-3-Clause
LICENSE_Lavis.md
BSD-3-Clause
LICENSE_Minigpt4.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published