Skip to content

Commit

Permalink
New dataset api (#61)
Browse files Browse the repository at this point in the history
* Reworking dataset; now supports loading 1D 1 out numerical and analytical.

* Moved loading into separate function.

* Extra checks in notebook

* Simplified API to use a function which returns data and grid.

* Updated with black.

* Proposed new structure for the dataset, with examples

* Did a black lint

* Refactored for use with dataloader

* Created a Loader for datasets that fit in GPU

* Small refactor

* Updated to a functional version of Dataset

* Created one loss to replace 3 seperate losses

* Renamed the data function to reflect they are fun

* Updated subsamplers

* Tried adding the new features to the examples

* Added the notebook

* Added black newline

* Updated the dataloader

* Updated the notebooks to the new api style

* Update the 2D AD example

* Added better support for dimensionful data

* Updated notebooks

* Changed the training loop

* Updated to correctly take MSE_test per feature

* subsampling along axis

* Added dimension to the data functions

* Updated docstrings and added better checks

* removed print

* Updated the notebooks to the newest version

* Removed the legacy datasets

* Updated the documentation

* Updated the notebooks

* Fixed some black whitespace shenanigans

* Remove test notebook

* Fixed the figure in the documentation.

* Delete Dataset.ipynb

This file is not needed.

* Update examples.py

Updated the function name.

Co-authored-by: Gert-Jan <[email protected]>
Co-authored-by: Remy Kusters <[email protected]>
  • Loading branch information
3 people authored Jun 2, 2021
1 parent 07934e0 commit 47c3366
Show file tree
Hide file tree
Showing 22 changed files with 1,036 additions and 908 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,6 @@ MANIFEST
__pycache__
src/DeePyMoD.egg/
site/
.eggs/
.eggs/
*events.out.tfevents.*
*.pt
58 changes: 58 additions & 0 deletions docs/datasets/data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Datasets

## The general workflow

The custom DeePyMoD dataset and dataloaders are created for data that typically fits in the RAM/VRAM during training, if this is not your use case, they are interchangeable with the PyTorch general [Datasets](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and
and [Dataloaders](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset).

For model discovery we typically want to add some noise to our dataset, normalize certain features and ensure it is in the right place for optimal PyTorch performance. This can easily be done by using the custom `deepymod.data.Dataset` and `deepymod.data.get_train_test_loader`. An illustration of the workflow is shown below:

![Workflow](../figures/data_workflow_for_deepymod.png)

## The dataset
The dataset needs a function that loads all the samples and returns it in a coordinate, data format.
```python
def load_data():
# create or load your data here
return coordinates, data
```
Here it is important that the last axis of the data is the number of features, even if it is just one. The dataset accepts data that is still dimensionful `(t,x,y,z,number_of_features)` as well as data that is already
flattened `(number_of_samples, number_of_features)`. After returning the data tries to apply the following functions to the samples that were just loaded: `preprocessing`, `subsampling` and lastly `shuffling`.

### Preprocessing
The preprocessing performs steps commonly used in the framework, normalizing the coordinates, normalizing the data and adding noise to the data. One can provide these choices via a dictionary of arguments:
```python
preprocess_kwargs: dict = {
"random_state": 42,
"noise_level": 0.0,
"normalize_coords": False,
"normalize_data": False,
}
```
And we can override the way we preprocess functions by defining the preprocess functions `apply_normalize`, `apply_noise` or even the way we shuffle using `apply_shuffle`.

### Subsampling
Sometimes we do not wish to use the whole dataset, and as such we can subsample it. Sometimes using a subset
of the time snapshots available is enough, for this we can use `deepymod.data.samples.Subsample_time` or
randomly with `deepymod.data.samples.Subsample_random`. You can provide the arguments for these functions via `subsampler_kwargs` to the Dataset.

### The resulting shape of the data
Since for random subsampling the (number_of_samples, number_of_features) format is better and for spatial
subsampling the (t,x,y,z,number_of_features) format is best, we accept both formats. However since the trainer
can only work with (number_of_samples, number_of_features), we will reshape the data to this format once
the data is preprocessed and subsampled. After this we can shuffle the data.

### Shuffling
If the data needs to be shuffled, `shuffle=True` can be used

## Dataloaders
Dataloaders are used in the PyTorch framework to ensure that the loading of the data goes smoothly,
for example with multiple workers. We however can typically fit the whole dataset into memory once,
so the overhead of the PyTorch Dataloader is not needed. We thus provide the Loader, which provides
a wrapper around the dataset. The loader will return the entire batch at once.

## Obtaining the dataloaders
In order to create a train and test split, we can use the function `get_train_test_loader`, which divides
the dataset into two pieces, and then directly passes them into the loader.


Binary file added docs/figures/data_workflow_for_deepymod.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
408 changes: 291 additions & 117 deletions examples/ODE_Example_coupled_nonlin.ipynb

Large diffs are not rendered by default.

276 changes: 152 additions & 124 deletions examples/PDE_2D_Advection-Diffusion.ipynb

Large diffs are not rendered by default.

252 changes: 102 additions & 150 deletions examples/PDE_Burgers.ipynb

Large diffs are not rendered by default.

195 changes: 66 additions & 129 deletions examples/PDE_KdV.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0
# Add here dependencies of your project (semicolon/line-separated), e.g.
install_requires = numpy
torch
sklearn
scikit-learn
pysindy
natsort
tensorboard
Expand Down
2 changes: 1 addition & 1 deletion src/deepymod/data/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .base import Dataset, Dataset_2D
from deepymod.data.base import Dataset, Loader, get_train_test_loader
Loading

0 comments on commit 47c3366

Please sign in to comment.