-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video support for datatrove #271
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much detail you wanted in the feedback, some stuff are nits but I included them just in case.
Main issue imo is no sharding on the reader which would be a problem if you want to scale up/use multiple cpus. It's easily fixible, basically just pass rank
and world_size
to your find_triplets function and change the return to return triplets[rank::world_size]
. This should be consistent across workers as list_files
is deterministic (it sorts the file list) so there should be no overlap between different workers
"""Filter that uses ffmpeg to detect if a video is static (frozen).""" | ||
|
||
name = "🧊 Video-Frozen-filter" | ||
_requires_dependencies = ["ffmpeg"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably be this to show users the correct package name to install
_requires_dependencies = ["ffmpeg"] | |
_requires_dependencies = [("ffmpeg", "ffmpeg-python")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to accept the actual change here I think
self.data_folder = get_datafolder(data_folder) | ||
self.paths_file = paths_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to save self.data_folder and self.paths_file as these are already saved by the base class
"""Overrides the base run method to handle triplet statistics correctly.""" | ||
triplet_count = 0 | ||
|
||
for triplet in self.find_triplets(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're currently not sharding, which will make it hard to parallelize/easily scale up to many cpus
with self.data_folder.open(self.paths_file, "r") as f: | ||
paths = [line.strip() for line in f] | ||
else: | ||
paths = self.data_folder.list_files(recursive=self.recursive, glob_pattern="*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't want to use the glob_pattern you can just pass None (or nothing as it is the default), but then you probably want to remove it from the __init__
args
|
||
for path in paths: | ||
base_name, ext = os.path.splitext(path) | ||
if ext in video_extensions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could maybe extract the triple finding thing away to the datafolder class later on (you'd pass a tuple like ((".mp4", ".avi", ".mkv", ".mov"), ".json", ".vtt")
), but it's probably ok for now
def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline: | ||
"""Overrides the base run method to handle triplet statistics correctly.""" | ||
triplet_count = 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows stacking multiple readers together (when you want to load data from multiple sources for example)
if data: | |
yield from data |
|
||
return document | ||
|
||
def ensure_local_copy(self, video_url: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job using all of datatrove useful io features here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just missing the change to check the python lib dependency
"""Filter that uses ffmpeg to detect if a video is static (frozen).""" | ||
|
||
name = "🧊 Video-Frozen-filter" | ||
_requires_dependencies = ["ffmpeg"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to accept the actual change here I think
WIP