Skip to content

Commit

Permalink
change code structure
Browse files Browse the repository at this point in the history
  • Loading branch information
mastoffel committed Nov 29, 2024
1 parent bdeb9a5 commit 24c59c2
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 10 deletions.
Binary file added output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper/eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 25 additions & 10 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ bibliography: paper.bib

# Summary

Simulations are ubiquitous in research and application, but are often too slow and computationally expensive to deeply explore the underlying system. One solution is to create efficient emulators (also surrogate- or meta-models) to approximate simulations, but this requires substantial expertise. Here, we present AutoEmulate, a low-code, AutoML-style python package for emulation. AutoEmulate makes it easy to fit and compare emulators, abstracting away the need for extensive machine learning (ML) experimentation. The package includes a range of emulators, from Gaussian Processes, Support Vector Machines and Gradient Boosting Models to novel, experimental deep learning emulators such as Neural Processes [@garnelo_conditional_2018]. AutoEmulate also implements global sensitivity analysis as a common emulator application, and we aim to add other applications in the future. Finally, AutoEmulate is designed to be easy to contribute to by being modular, integrated with the scikit-learn ecosystem [@pedregosa_scikit-learn_2011], and well documented. We aim to iterate based on user feedback to make AutoEmulate a tool for end-to-end emulation across fields.
Simulations are ubiquitous in research and application, but are often too slow and computationally expensive to deeply explore the underlying system. One solution is to create efficient emulators (also surrogate- or meta-models) to approximate simulations, but this requires substantial expertise. Here, we present AutoEmulate, a low-code, AutoML-style python package for emulation. AutoEmulate makes it easy to fit and compare emulators, abstracting away the need for extensive machine learning (ML) experimentation. The package includes a range of emulators, from Gaussian Processes, Support Vector Machines and Gradient Boosting Models to novel, experimental deep learning emulators such as Neural Processes [@garnelo_conditional_2018]. It also implements global sensitivity analysis as a common emulator application, which quantifies the relative contribution of different inputs to the output variance. In the future, with user feedback and contributions, we aim to organically grow AutoEmulate into an end-to-end tool for most emulation problems.

# Statement of need

Expand All @@ -54,36 +54,51 @@ AutoEmulate automates emulator building, with the goal to eventually streamline

# Pipeline

The minimal input for AutoEmulate are X, y, where X is a 2D array (e.g. numpy-array, Pandas DataFrame) containing one simulation parameter per column and their values in rows, and y is an array containing the corresponding simulation outputs, where y can be either single or multi-output. After a dataset X, y has been constructed by evaluating the original simulation, we can create an emulator with AutoEmulate in just three lines of code:
The inputs for AutoEmulate are X and y, where X is a 2D array (e.g. numpy-array, Pandas DataFrame) containing one simulation parameter per column and their values in rows, and y is an array containing the corresponding simulation outputs. A dataset X, y is usually constructed by sampling input parameters X using Latin Hypercube Sampling (McKay et al., 1979) and evaluating the simulation on these inputs to obtain outputs y. With X and y, we can create an emulator with AutoEmulate in just a few lines of code.

```python
from autoemulate.compare import AutoEmulate

# creating an emulator
ae = AutoEmulate()
ae.setup(X, y) # allows to customise pipeline
emulator = ae.compare() # compares emulators & returns
ae.compare() # compares emulators
```

Under the hood, AutoEmulate runs a complete ML pipeline. It splits the data into training and test sets, standardises inputs, fits a set of user-specified emulators, compares them using cross-validation and optionally optimises hyperparameters using pre-defined search spaces. It then returns the emulator with the highest average cross-validation R^2 score. The results can then easily be summarised and visualised.
Under the hood, AutoEmulate runs a complete ML pipeline. It splits the data into training and test sets, standardises inputs, fits a set of user-specified emulators, compares them using cross-validation and optionally optimises hyperparameters using pre-defined search spaces. The cross-validation results can then easily be summarised and visualised.

```python
# cross-validation results
ae.summarise_cv() # cv metrics for each model
ae.plot_cv() # visualise best cv fold per model
# ae.plot_cv() # visualise cv results
ae.summarise_cv() # cv metrics for each model
```

After choosing an emulator based on its cross-validation performance, it can be evaluated on the test set, which by default is 20% of the original dataset. If the test-set performance is acceptable, the emulator can be refitted on the combined training and test data before applying it.
| Model | Short Name | RMSE ||
|-------|------------|------|-----|
| Gaussian Process | gp | 0.1027 | 0.9851 |
| Random Forest | rf | 0.1511 | 0.9677 |
| Gradient Boosting | gb | 0.1566 | 0.9642 |
| Conditional Neural Process | cnp | 0.1915 | 0.9465 |
| Radial Basis Functions | rbf | 0.3518 | 0.7670 |
| Support Vector Machines | svm | 0.4924 | 0.6635 |
| LightGBM | lgbm | 0.6044 | 0.4930 |
| Second Order Polynomial | sop | 0.8378 | 0.0297 |

After choosing an emulator based on its cross-validation performance, it can be evaluated on the test set, which by default is 20% of the original dataset.

```python
# evaluating the emulator
ae.evaluate(emulator) # test set scores
emulator = ae.refit(emulator) # refit using full data
emulator = ae.get_model("GaussianProcess")
ae.evaluate(emulator) # get test set scores
ae.plot_eval(emulator) # visualise test set predictions
```

The emulator can now be used as an efficient replacement for the original simulation by generating tens of thousands of new data points in milliseconds using predict(). We’ve also implemented global sensitivity analysis, a common use-case for emulators, which decomposes the variance in the output(s) into the contributions of the various simulation parameters and their interactions.
![Test set predictions](eval.png)

If the test-set performance is acceptable, the emulator can be refitted on the combined training and test data before applying it. The emulator can now be used as an efficient replacement for the original simulation by generating tens of thousands of new data points in milliseconds using predict(). We’ve also implemented global sensitivity analysis, a common use-case for emulators, which decomposes the variance in the outputs into the contributions of the various simulation parameters and their interactions.

```python
emulator = ae.refit(emulator) # refit using full data
# application
emulator.predict(X) # generate new samples
ae.sensitivity_analysis(emulator) # global SA with Sobol indices
Expand Down

0 comments on commit 24c59c2

Please sign in to comment.