Hate speech detection in online videos is an important but challenging problem, especially with the rise of video-sharing platforms. Existing solutions rely primarily on unimodal models focused on text or image inputs, with less emphasis on multimodal models that analyze both visual and audio aspects of videos. We present Hate-LLama, an instruction-tuned audio-visual language model fine-tuned on a labeled hate speech video dataset named HateMM. Hate-LLaMA is a finetuned version of Video-LLaMA. It accepts video input and makes hate speech classifications by analyzing both visual frames and audio in a multimodal fashion. Hate-LLaMA efficiently detects hate content with an accuracy of 71%.
Another major challenge of hate speech detection on videos is the scarcity of labeled video datasets, hence we also propose a benchmark dataset of around 300 videos consisting of 33% hate and 67% non-hate content.
conda env create -f environment.yml
conda activate hatellama
pip install -r requiremens.txt
download and move the checkpoints to /ckpt
folder.
-
Download meta-llama/Llama-2-7b-chat-hf from huggingface.
-
Download checkpoints for finetuned audio and video branch Hate-LLaMA and imagebind encoder from here
-
To download our curated benchmark, click here
-
For the HateMM dataset, please refer
To run execute the demo
pip install -r requirements-demo.txt
python3 app.py
Executing the demo requires one gpu (preferably RTX8000/A100).
Adapting the dataset to instruction-tuning format, use convert-data.py
python script.
For pretrained video-llama checkpoints, please refer.
To finetune the audio and video branches using these pretrained checkpoints -
configure the checkpoints and hyperparameters inside the audiobranch_stage2_finetune.yaml
and visionbranch_stage2_finetune.yaml
conda activate hatellama
# Finetune the Vision-language branch
torchrun --nproc_per_node=4 train.py --cfg-path ./train_configs/visionbranch_stage2_finetune.yaml
# Finetune the Audio-language branch
torchrun --nproc_per_node=4 train.py --cfg-path ./train_configs/audiobranch_stage2_finetune.yaml
The finetuning process was done over 4 RTX8000 GPUs.
To evaluate the models performance against the test sets :
python inference.py --gpu-id=0 --cfg-path="eval_configs/video_llama_eval_withaudio_stage3.yaml” --ckpt_root="output/"
to compute accuracy and F-1 score:
unzip Results.npz
python compute_metrics.py whole_results.npy
We also provide the code to crawl bitchute platform and curate the benchmark in /benchmark
.
The updated code to run the HateMM baseline for our benchmark is provided in /baseline
.
We are grateful for the following open-source repos that helped us build our project
- Anisha Bhatnagar ([email protected])
- Divyanshi Parashar ([email protected])
- Simran Makariye ([email protected])