You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch is currently rearchitecting their I/O pipelines because the indexed datasets don't scale well to large learning problems. Many pipelines will likely be based on IterableDataset.
The changes to PyTorch are related to our WebDataset library (github.com/tmbdev/webdataset), which demonstrably provides linearly scalable I/O for large scale deep learning.
I have recently written a first implementation of WebDataset.jl that can read the same format; it provides multithreaded I/O and decoding, as well as hooks for sharding and shuffling. It's at github.com/tmbdev/WebDataset.jl
As an aside, the use of tuples to represent training data in PyTorch (rather than structs/records/dicts) has been a perennial problem for reusing and debugging I/O pipelines, and probably not a design that should be carried over.
Anyway, my suggestion would be to have a look at WebDataset.jl and see whether it or parts of it are useful in the architecture of future dataloaders.
The text was updated successfully, but these errors were encountered:
tmbdev
changed the title
some comments
WebDataset.jl and comments
Apr 28, 2021
PyTorch is currently rearchitecting their I/O pipelines because the indexed datasets don't scale well to large learning problems. Many pipelines will likely be based on IterableDataset.
The changes to PyTorch are related to our WebDataset library (github.com/tmbdev/webdataset), which demonstrably provides linearly scalable I/O for large scale deep learning.
I have recently written a first implementation of WebDataset.jl that can read the same format; it provides multithreaded I/O and decoding, as well as hooks for sharding and shuffling. It's at github.com/tmbdev/WebDataset.jl
As an aside, the use of tuples to represent training data in PyTorch (rather than structs/records/dicts) has been a perennial problem for reusing and debugging I/O pipelines, and probably not a design that should be carried over.
Anyway, my suggestion would be to have a look at WebDataset.jl and see whether it or parts of it are useful in the architecture of future dataloaders.
The text was updated successfully, but these errors were encountered: