New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Video support for datatrove #271

Draft

guipenedo wants to merge 5 commits into huggingface:main from mfarre:f-mediabundle

Collaborator

guipenedo commented Aug 21, 2024

WIP


          adding videotriplet reader, frozen filter and media_experiment.py exa…

276dd6c

…mple

guipenedo commented

View reviewed changes

Collaborator Author

guipenedo left a comment

Not sure how much detail you wanted in the feedback, some stuff are nits but I included them just in case.
Main issue imo is no sharding on the reader which would be a problem if you want to scale up/use multiple cpus. It's easily fixible, basically just pass rank and world_size to your find_triplets function and change the return to return triplets[rank::world_size]. This should be consistent across workers as list_files is deterministic (it sorts the file list) so there should be no overlap between different workers

src/datatrove/pipeline/filters/video_frozen_filter.py

+                  """Filter that uses ffmpeg to detect if a video is static (frozen)."""
+                  name = "🧊 Video-Frozen-filter"
+                  _requires_dependencies = ["ffmpeg"]

Collaborator Author

guipenedo Aug 21, 2024

Should probably be this to show users the correct package name to install

Suggested change

      
                _requires_dependencies = ["ffmpeg"]
          
                _requires_dependencies = [("ffmpeg", "ffmpeg-python")]

Collaborator Author

guipenedo Aug 26, 2024

You forgot to accept the actual change here I think

src/datatrove/pipeline/readers/videotriplet.py Outdated

Comment on lines 32 to 33

		self.data_folder = get_datafolder(data_folder)
		self.paths_file = paths_file

Collaborator Author

guipenedo Aug 21, 2024

you don't need to save self.data_folder and self.paths_file as these are already saved by the base class

src/datatrove/pipeline/readers/videotriplet.py Outdated

+                      """Overrides the base run method to handle triplet statistics correctly."""
+                      triplet_count = 0
+                      for triplet in self.find_triplets():

Collaborator Author

guipenedo Aug 21, 2024

You're currently not sharding, which will make it hard to parallelize/easily scale up to many cpus

src/datatrove/pipeline/readers/videotriplet.py Outdated

+                          with self.data_folder.open(self.paths_file, "r") as f:
+                              paths = [line.strip() for line in f]
+                      else:
+                          paths = self.data_folder.list_files(recursive=self.recursive, glob_pattern="*")

Collaborator Author

guipenedo Aug 21, 2024

if you don't want to use the glob_pattern you can just pass None (or nothing as it is the default), but then you probably want to remove it from the __init__ args

src/datatrove/pipeline/readers/videotriplet.py

+                      for path in paths:
+                          base_name, ext = os.path.splitext(path)
+                          if ext in video_extensions:

Collaborator Author

guipenedo Aug 21, 2024

we could maybe extract the triple finding thing away to the datafolder class later on (you'd pass a tuple like ((".mp4", ".avi", ".mkv", ".mov"), ".json", ".vtt")), but it's probably ok for now

src/datatrove/pipeline/readers/videotriplet.py Outdated

+                  def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:
+                      """Overrides the base run method to handle triplet statistics correctly."""
+                      triplet_count = 0

Collaborator Author

guipenedo Aug 21, 2024

This allows stacking multiple readers together (when you want to load data from multiple sources for example)

Suggested change

      
                    if data:
          
                        yield from data

src/datatrove/pipeline/readers/videotriplet.py


		return document

		def ensure_local_copy(self, video_url: str) -> str:

Collaborator Author

guipenedo Aug 21, 2024

good job using all of datatrove useful io features here :)

src/datatrove/pipeline/filters/video_frozen_filter.py Outdated Show resolved Hide resolved

src/datatrove/pipeline/filters/video_frozen_filter.py Outdated Show resolved Hide resolved

src/datatrove/pipeline/filters/video_frozen_filter.py Outdated Show resolved Hide resolved

mfarre added 4 commits

August 22, 2024 22:41


          addressing Guilherme's input

fe8f1e6


          ruff src fixes

4ad4cfc


          ruff on examples

b31717f


          ruff

5e56234

guipenedo commented

View reviewed changes

Collaborator Author

guipenedo left a comment

Looking good, just missing the change to check the python lib dependency

src/datatrove/pipeline/filters/video_frozen_filter.py

+                  """Filter that uses ffmpeg to detect if a video is static (frozen)."""
+                  name = "🧊 Video-Frozen-filter"
+                  _requires_dependencies = ["ffmpeg"]

Collaborator Author

guipenedo Aug 26, 2024

You forgot to accept the actual change here I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet