This repository contains code and meta-data to (re-)create the YouTube Titles and Transcripts Dataset as described in the following paper:
Christopher Lovejoy, William Davies, Demian Till and Louis Prosser. The Truth Is In The Title? Video Title Generation as a novel training objective for video summarisation. May 2021.
Please cite the following paper in all academic work that uses this dataset:
@inproceedings{lovejoy2021videotitle,
title = {The Truth Is In The Title? Video Title Generation as a novel training objective for video summarisation},
author = {Lovejoy, Davies, Till and Prosser},
year = {2021},
}
We also acknowledge earlier work (including first-time data collection) on the same data (and encourage you to do the same):
- Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson. CNN Architectures for Large-Scale Audio Classification. September 2016. https://arxiv.org/abs/1609.09430v2
The full corpus consists of 1.2 million YouTube videos with transcripts and titles. Within this, there is a subset of 17,886 videos which meet the following criteria (i) YouTube metadata set to English language, (ii) human-generated summaries, with punctuation and (iii) titles only includes ASCII characters.
Both are available for download below.
You can obtain the corpus in one of two ways:
This contains 17,886 videos with punctuated English transcripts and titles. This is the dataset originally used in [paper title + link].
It can be downloaded here.
This contains the full corpus of all 1.2 million YouTube videos with transcripts and titles.
It can be downloaded here.
(Option 2): Use this reproducible pipeline
If any issues, please raise them on this repository (https://github.com/chris-lovejoy/youtube-titles-and-transcripts/issues) or contact us via email.
At the time of release, all videos in the dataset have been made available by the original content creators under the standard YouTube License.
We are providing the contents of this repository under the Creative Commons BY-SA 4.0 (Attribution-Share-Alike) License (for data-like content) and/ or BSD-2-Clause License (for software-type content).