Finding Scenes in Learning Videos is a project that enhances the searchability and usability of educational videos by automatically generating detailed and structured metadata. Leveraging state-of-the-art Vision Language Models (VLMs) and Large Language Models (LLMs), the pipeline extracts both visual and auditory content from YouTube videos, enabling users to search for specific scenes based on comprehensive information about the video content. This pipeline is particularly useful for educators, students, and content creators looking to streamline the process of finding specific information within long-format video content.
This repo is a fork of the original repository in cooperation with the Fraunhofer Institute.
- Automatic Metadata Generation: Creates rich, searchable metadata by processing a YouTube video link, allowing users to navigate and find specific scenes or topics within the video.
- Scene Detection: Automatically segments videos into meaningful scenes using advanced visual and auditory analysis.
- Key Frame Extraction: Identifies significant frames within each scene to optimize metadata generation, ensuring efficient and high-quality content analysis.
- Comprehensive Metadata Structure: Captures detailed information such as video descriptions, scene captions, key concepts, questions, resources, and much more.
- User-Friendly API: Allows users to input a YouTube link and receive metadata, supporting seamless integration into other applications.
- JSON Output: Returns metadata in a structured JSON format, making it easy to parse, manipulate, and use in various applications.
It is recommended to set up a virtual environment to manage dependencies.
pip install -r requirements.txt
sudo pacman -S ffmpeg
sudo apt-get install ffmpeg
sudo apt install imagemagick
Additionally, configure ImageMagick permissions:
sudo sh -c "cat /etc/ImageMagick-6/policy.xml | sed 's/none/read,write/g' > /etc/ImageMagick-6/policy.xml"
You can run the pipeline either locally or using Google Colab.
-
Start the FastAPI Server:
uvicorn main:app --reload
-
Submit a YouTube Link:
Access the API endpoint via your browser or use tools like
curl
orPostman
to submit a YouTube link and receive the corresponding metadata.
Open the colab_pipeline.ipynb
notebook in Google Colab and follow the instructions to input your YouTube video link and generate metadata.
The pipeline is designed to automatically generate metadata for educational videos, capturing essential details to enhance searchability and usability. The process follows these steps:
- Video Downloading: Utilizes
YT-DLP
to robustly download YouTube videos. - Audio Transcript Extraction: Employs
Pytubefix
to extract audio transcripts and additional metadata from the video. - Scene Detection: Uses
PySceneDetect
to segment the video into scenes, identifying significant scene boundaries through changes in visual properties. - Key Frame Extraction: Leverages
Katna
to select key frames within each scene based on metrics like brightness, contrast, and color changes. - Metadata Generation:
- Vision Language Models (VLMs): Processes key frames and audio transcripts to generate initial metadata.
- Large Language Models (LLMs): Contextualizes and enriches the metadata with comprehensive descriptions for each scene and the entire video.
- JSON Metadata Output: Compiles all generated data into a structured JSON object, providing detailed metadata for each scene and the video as a whole.
- Ease of Use: Users only need to input a YouTube link to interact with the pipeline, which is also available as an API.
- Enhanced Searchability: Meaningful scene segmentation makes each part of the video easily searchable.
- Efficient Metadata Creation: Key frame selection optimizes the metadata generation process, avoiding unnecessary processing of every frame.
- Contextual Metadata Content: Combines visual data from key frames with auditory data from transcripts to create in-depth metadata.
- Structured Output Format: The JSON metadata format facilitates integration into other applications and platforms.
The metadata is structured in JSON format, organized to provide rich information for each video and scene.
{
"MetaDataObject": {
"youtube_title": "str",
"youtube_description": "str",
"published_date": "str",
"youtube_video_id": "str",
"youtube_thumbnail_url": "str",
"youtube_rating": "str",
"youtube_views": "str",
"youtube_age_restricted": "str",
"youtube_keywords": ["str"],
"youtube_author": "str",
"youtube_channel_id": "str",
"youtube_length": "int",
"url": "str",
"llm_description": "str",
"learning_resource_type": "str",
"intended_end_user_role": "str",
"context": "str",
"difficulty_level": "str",
"discipline": "str",
"educational_level": "str",
"target_audience_age": "str",
"typical_learning_time": "str",
"scene_objects": ["SceneObject"]
},
"SceneObject": {
"duration": "str",
"scene_start": "str",
"scene_end": "str",
"title": "str",
"caption": "str",
"key-concepts": "str",
"questions": "str",
"text": "str",
"resources": "str",
"language": "str",
"video_type": "str"
}
}
Note: Some fields may contain inaccuracies due to current limitations in the metadata generation process.
To assess the pipeline's accuracy, a manually annotated dataset containing captions for 10 learning videos was used. The evaluation metrics include BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. While the model's scores are slightly lower than existing video captioning methods, this highlights areas for improvement in terms of relevance and accuracy.
Metric | Score |
---|---|
BLEU@1 | 0.244 |
BLEU@2 | 0.121 |
BLEU@3 | 0.071 |
BLEU@4 | 0.047 |
METEOR | 0.067 |
ROUGE-L | 0.184 |
CIDEr | 0.075 |
SPICE | 0.071 |
- Model Enhancement: Use larger, more specialized VLMs and LLMs for improved metadata quality.
- Fine-Tuning: Tailor models for educational video contexts to better capture nuances.
- Audio Analysis Integration: Use audio analysis to improve scene categorization and metadata relevance.
- Semantic Search Integration: Enhance content retrieval with semantic search and embedding models.
- Quality Assessment: Develop methods to assess and ensure the quality of generated metadata.
This project is licensed under the MIT License.
Special thanks to the following open-source tools and libraries used in this project:
- FastAPI
- YT-DLP
- Pytubefix
- PySceneDetect
- Katna
- Idefics2-8B
- Mistral-7B-Instruct-v0.3
If you use this project in your research, please cite it as follows:
@inproceedings{finding_scenes_2024,
title={Finding Scenes in Learning Videos},
author={Hamdad, Samy and Laule, Veit and Malek, Limin},
booktitle={AWT Project SS 24},
year={2024},
organization={OpenAI}
}