New dataset api (#61)

* Reworking dataset; now supports loading 1D 1 out numerical and analytical. * Moved loading into separate function. * Extra checks in notebook * Simplified API to use a function which returns data and grid. * Updated with black. * Proposed new structure for the dataset, with examples * Did a black lint * Refactored for use with dataloader * Created a Loader for datasets that fit in GPU * Small refactor * Updated to a functional version of Dataset * Created one loss to replace 3 seperate losses * Renamed the data function to reflect they are fun * Updated subsamplers * Tried adding the new features to the examples * Added the notebook * Added black newline * Updated the dataloader * Updated the notebooks to the new api style * Update the 2D AD example * Added better support for dimensionful data * Updated notebooks * Changed the training loop * Updated to correctly take MSE_test per feature * subsampling along axis * Added dimension to the data functions * Updated docstrings and added better checks * removed print * Updated the notebooks to the newest version * Removed the legacy datasets * Updated the documentation * Updated the notebooks * Fixed some black whitespace shenanigans * Remove test notebook * Fixed the figure in the documentation. * Delete Dataset.ipynb This file is not needed. * Update examples.py Updated the function name. Co-authored-by: Gert-Jan <[email protected]> Co-authored-by: Remy Kusters <[email protected]>
PhIMaL · Jun 2, 2021 · 47c3366 · 47c3366
1 parent 07934e0
commit 47c3366
Show file tree

Hide file tree

Showing 22 changed files with 1,036 additions and 908 deletions.
diff --git a/.gitignore b/.gitignore
@@ -52,4 +52,6 @@ MANIFEST
 __pycache__
 src/DeePyMoD.egg/
 site/
-.eggs/
+.eggs/
+*events.out.tfevents.*
+*.pt
diff --git a/docs/datasets/data.md b/docs/datasets/data.md
@@ -0,0 +1,58 @@
+# Datasets
+
+## The general workflow
+
+The custom DeePyMoD dataset and dataloaders are created for data that typically fits in the RAM/VRAM during training, if this is not your use case, they are interchangeable with the PyTorch general [Datasets](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and 
+ and [Dataloaders](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset).
+
+For model discovery we typically want to add some noise to our dataset, normalize certain features and ensure it is in the right place for optimal PyTorch performance. This can easily be done by using the custom `deepymod.data.Dataset` and `deepymod.data.get_train_test_loader`. An illustration of the workflow is shown below:
+
+![Workflow](../figures/data_workflow_for_deepymod.png)
+
+## The dataset
+The dataset needs a function that loads all the samples and returns it in a coordinate, data format. 
+```python
+def load_data():
+    # create or load your data here
+    return coordinates, data
+``` 
+Here it is important that the last axis of the data is the number of features, even if it is just one. The dataset accepts data that is still dimensionful `(t,x,y,z,number_of_features)` as well as data that is already
+flattened `(number_of_samples, number_of_features)`. After returning the data tries to apply the following functions to the samples that were just loaded: `preprocessing`, `subsampling` and lastly `shuffling`. 
+
+### Preprocessing
+The preprocessing performs steps commonly used in the framework, normalizing the coordinates, normalizing the data and adding noise to the data. One can provide these choices via a dictionary of arguments: 
+```python
+preprocess_kwargs: dict = {
+            "random_state": 42,
+            "noise_level": 0.0,
+            "normalize_coords": False,
+            "normalize_data": False,
+        }
+```
+And we can override the way we preprocess functions by defining the preprocess functions `apply_normalize`, `apply_noise` or even the way we shuffle using `apply_shuffle`. 
+
+### Subsampling
+Sometimes we do not wish to use the whole dataset, and as such we can subsample it. Sometimes using a subset
+of the time snapshots available is enough, for this we can use `deepymod.data.samples.Subsample_time` or
+randomly with `deepymod.data.samples.Subsample_random`. You can provide the arguments for these functions via `subsampler_kwargs` to the Dataset.
+
+### The resulting shape of the data
+Since for random subsampling the (number_of_samples, number_of_features) format is better and for spatial 
+subsampling the (t,x,y,z,number_of_features) format is best, we accept both formats. However since the trainer
+can only work with (number_of_samples, number_of_features), we will reshape the data to this format once
+the data is preprocessed and subsampled. After this we can shuffle the data. 
+
+### Shuffling
+If the data needs to be shuffled, `shuffle=True` can be used
+
+## Dataloaders
+Dataloaders are used in the PyTorch framework to ensure that the loading of the data goes smoothly, 
+for example with multiple workers. We however can typically fit the whole dataset into memory once, 
+so the overhead of the PyTorch Dataloader is not needed. We thus provide the Loader, which provides 
+a wrapper around the dataset. The loader will return the entire batch at once.
+
+## Obtaining the dataloaders 
+In order to create a train and test split, we can use the function `get_train_test_loader`, which divides
+the dataset into two pieces, and then directly passes them into the loader.
+
+
diff --git a/docs/figures/data_workflow_for_deepymod.png b/docs/figures/data_workflow_for_deepymod.png
diff --git a/examples/ODE_Example_coupled_nonlin.ipynb b/examples/ODE_Example_coupled_nonlin.ipynb
diff --git a/examples/PDE_2D_Advection-Diffusion.ipynb b/examples/PDE_2D_Advection-Diffusion.ipynb
diff --git a/examples/PDE_Burgers.ipynb b/examples/PDE_Burgers.ipynb
diff --git a/examples/PDE_KdV.ipynb b/examples/PDE_KdV.ipynb
diff --git a/setup.cfg b/setup.cfg
@@ -32,7 +32,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0
 # Add here dependencies of your project (semicolon/line-separated), e.g.
 install_requires = numpy 
                    torch 
-                   sklearn 
+                   scikit-learn 
                    pysindy
                    natsort
                    tensorboard

diff --git a/src/deepymod/data/__init__.py b/src/deepymod/data/__init__.py
@@ -1 +1 @@
-from .base import Dataset, Dataset_2D
+from deepymod.data.base import Dataset, Loader, get_train_test_loader