YouTube Audio Collector is a Python script that downloads audio from specified YouTube channels, extracts captions, and creates a dataset of audio chunks with corresponding transcriptions. It's particularly designed to collect audio segments that contain both Arabic and English text.
- Downloads audio from multiple YouTube channels
- Extracts manually created captions
- Cuts audio into chunks based on caption timing
- Creates a dataset with audio files and corresponding transcriptions
- Pushes the created dataset to Hugging Face Hub
- Python 3.7+
- FFmpeg (for audio processing)
- Clone this repository:
https://github.com/MohamedAliRashad/youtube-audio-collector.git
cd youtube-audio-collector
- Install the required Python packages:
pip install -r requirements.txt
- Install FFmpeg (if not already installed):
- Ubuntu/Debian:
sudo apt-get install ffmpeg
- macOS (with Homebrew):
brew install ffmpeg
- Windows: Download from the official FFmpeg website and add to PATH
-
Create a text file with YouTube channel URLs, one per line.
-
Run the script with the following command:
python youtube_audio_collector.py --output_dir my_audio_dir --urls_file channel_urls.txt --hub_dataset_name my-dataset --private
Arguments:
--output_dir
: Directory to save audio files (default: 'audio')--urls_file
: File containing YouTube channel URLs (required)--hub_dataset_name
: Name of the dataset on Hugging Face Hub--private
: Flag to push the dataset as private (optional)
- The script reads YouTube channel URLs from the specified file.
- For each video in each channel:
- It checks for manually created captions containing both Arabic and English text.
- If suitable captions are found, it downloads the audio.
- The audio is then cut into chunks based on the caption timing.
- Audio chunks and their corresponding transcriptions are saved.
- A dataset is created from the collected audio and captions.
- The dataset is pushed to the Hugging Face Hub.
This script is designed for research and educational purposes. Ensure you comply with YouTube's terms of service and respect copyright laws when using this tool.
Contributions, issues, and feature requests are welcome. Feel free to check issues page if you want to contribute.