WebDataset.jl and comments #22

tmbdev · 2021-04-28T20:22:04Z

PyTorch is currently rearchitecting their I/O pipelines because the indexed datasets don't scale well to large learning problems. Many pipelines will likely be based on IterableDataset.

The changes to PyTorch are related to our WebDataset library (github.com/tmbdev/webdataset), which demonstrably provides linearly scalable I/O for large scale deep learning.

I have recently written a first implementation of WebDataset.jl that can read the same format; it provides multithreaded I/O and decoding, as well as hooks for sharding and shuffling. It's at github.com/tmbdev/WebDataset.jl

As an aside, the use of tuples to represent training data in PyTorch (rather than structs/records/dicts) has been a perennial problem for reusing and debugging I/O pipelines, and probably not a design that should be carried over.

Anyway, my suggestion would be to have a look at WebDataset.jl and see whether it or parts of it are useful in the architecture of future dataloaders.

tmbdev changed the title ~~some comments~~ WebDataset.jl and comments Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebDataset.jl and comments #22

WebDataset.jl and comments #22

tmbdev commented Apr 28, 2021

WebDataset.jl and comments #22

WebDataset.jl and comments #22

Comments

tmbdev commented Apr 28, 2021