Use preprocessed and cached tfrecord datasets with PyTorch #280

ampersandmcd · 2024-06-24T11:38:42Z

IceNet version: 0.2.8
Python version: 3.11.7
Operating System: CentOS Linux 7 (BAS HPC)

Description

To accelerate training with PyTorch, it would be useful to make use of the preprocessed and cached tfrecord datasets we've produced for operational Tensorflow training! Loading from these tfrecord files is (should be?) much faster than loading from NetCDF files on the fly using the generate_sample logic in, e.g., the DaskMultiWorkerLoader.generate_sample defined in data/loaders/dask.py which must perform a decent amount of computation.

What I Did

I've implemented an IterableIceNetDatasetPyTorch class which inherits from torch.utils.data.IterableDataset as a first go, and will link it below in a pull request. The following script demonstrates its use and can easily be stepped through in a debugger. The batching logic works properly with torch.utils.data.DataLoader in the script below, but I've seen some weird behaviour (overrunning the number of samples that should be generated per epoch) during training runs when num_workers > 1 in the torch.utils.data.DataLoader so there's room to improve this first implementation.

import os
import torch
from utils import IterableIceNetDataSetPyTorch

dataset_config = "dataset_config.exp23_south.json"

ds = IterableIceNetDataSetPyTorch(dataset_config, "test", batch_size=4, shuffling=False)
dl = torch.utils.data.DataLoader(ds, batch_size=4, shuffle=False)

for i, batch in enumerate(dl):
    x, y, sw = batch
    print(x, y, sw)
    print(i)

The text was updated successfully, but these errors were encountered:

JimCircadian · 2024-06-24T12:33:19Z

CC @bnubald who has done some work in this area already

ampersandmcd mentioned this issue Jun 24, 2024

Adds implementation of iterable pytorch dataset for use with tfrecords #281

Closed

ampersandmcd linked a pull request Jun 24, 2024 that will close this issue

Adds implementation of iterable pytorch dataset for use with tfrecords #281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use preprocessed and cached tfrecord datasets with PyTorch #280

Use preprocessed and cached tfrecord datasets with PyTorch #280

ampersandmcd commented Jun 24, 2024

JimCircadian commented Jun 24, 2024

Use preprocessed and cached tfrecord datasets with PyTorch #280

Use preprocessed and cached tfrecord datasets with PyTorch #280

Comments

ampersandmcd commented Jun 24, 2024

Description

What I Did

JimCircadian commented Jun 24, 2024