Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebDataset.jl and comments #22

Open
tmbdev opened this issue Apr 28, 2021 · 0 comments
Open

WebDataset.jl and comments #22

tmbdev opened this issue Apr 28, 2021 · 0 comments

Comments

@tmbdev
Copy link

tmbdev commented Apr 28, 2021

PyTorch is currently rearchitecting their I/O pipelines because the indexed datasets don't scale well to large learning problems. Many pipelines will likely be based on IterableDataset.

The changes to PyTorch are related to our WebDataset library (github.com/tmbdev/webdataset), which demonstrably provides linearly scalable I/O for large scale deep learning.

I have recently written a first implementation of WebDataset.jl that can read the same format; it provides multithreaded I/O and decoding, as well as hooks for sharding and shuffling. It's at github.com/tmbdev/WebDataset.jl

As an aside, the use of tuples to represent training data in PyTorch (rather than structs/records/dicts) has been a perennial problem for reusing and debugging I/O pipelines, and probably not a design that should be carried over.

Anyway, my suggestion would be to have a look at WebDataset.jl and see whether it or parts of it are useful in the architecture of future dataloaders.

@tmbdev tmbdev changed the title some comments WebDataset.jl and comments Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant