From 6ea717dd96e22c89d0fa35de7a195b3968474870 Mon Sep 17 00:00:00 2001 From: mastoffel Date: Mon, 2 Dec 2024 13:44:39 +0000 Subject: [PATCH] update paper --- paper/paper.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/paper/paper.md b/paper/paper.md index ecd8f89a..98c531b9 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -48,9 +48,9 @@ Simulations are ubiquitous in research and application, but are often too slow a # Statement of need -To understand complex real-world systems, researchers and engineers often construct computer simulations. These can be computationally expensive and take minutes, hours or even days to run. For tasks like optimisation, sensitivity analysis or uncertainty quantification where thousands or even millions of runs are needed, a solution has long been to approximate simulations with efficient emulators, which can be orders of magnitudes faster [@forrester_recent_2009, @kudela_recent_2022]. Emulation is becoming increasingly widespread, ranging from engineering [@yondo_review_2018], architecture [@westermann_surrogate_2019], biomedical [@strocci_cell_2023] and climate science [@bounceur_global_2015], to agent-based models [@angione_using_2022]. A typical emulation pipeline involves three steps: 1. Evaluating the expensive simulation at a small, strategically chosen set of inputs using techniques such as Latin Hypercube Sampling [@mckay_comparison_1979] to create a representative dataset, 2. constructing a high-accuracy emulator using that dataset, which involves model selection, hyperparameter optimisation and evaluation and 3. applying the emulator to tasks such as prediction, sensitivity analysis, or optimisation. Building an emulator in particular is a key challenge which requires substantial machine learning experimentation within an ever increasing ecosystem of models and packages. This puts a substantial burden on practitioners whose main focus is to explore the underlying system, not building the emulator. +To understand complex real-world systems, researchers and engineers often construct computer simulations. These can be computationally expensive and take minutes, hours or even days to run. For tasks like optimisation, sensitivity analysis or uncertainty quantification where thousands or even millions of runs are needed, a solution has long been to approximate simulations with efficient emulators, which can be orders of magnitudes faster [@forrester_recent_2009; @kudela_recent_2022]. Emulation is becoming increasingly widespread, ranging from engineering [@yondo_review_2018], architecture [@westermann_surrogate_2019], biomedical [@strocchi_cell_2023] and climate science [@bounceur_global_2015], to agent-based models [@angione_using_2022]. A typical emulation pipeline involves three steps: 1. Evaluating the simulation at a small, strategically chosen set of inputs using techniques such as Latin Hypercube Sampling [@mckay_comparison_1979] to create a representative dataset, 2. constructing a high-accuracy emulator using that dataset, which involves model selection, hyperparameter optimisation and evaluation and 3. applying the emulator to tasks such as prediction, sensitivity analysis, or optimisation. Building an emulator in particular requires substantial machine learning experience and knowledge of an ever increasing ecosystem of models and packages. This puts a substantial burden on practitioners whose main focus is to explore the underlying system, not building the emulator. -AutoEmulate automates emulator building, with the goal to eventually streamline the whole emulation pipeline. For people new to ML, AutoEmulate compares, optimises and evaluates a range of models to create an efficient emulator for their simulation in just a few lines of code. For experienced surrogate modellers, AutoEmulate provides a reference set of cutting-edge emulators to quickly benchmark new models against. The package includes classic emulators such as Radial Basis Functions and Gaussian Processes, established ML models like Gradient Boosting and Support Vector Machines, as well as experimental deep learning emulators such as [Conditional Neural Processes](https://yanndubs.github.io/Neural-Process-Family/text/Intro.html) [@garnelo_conditional_2018]. AutoEmulate is built to be extensible. Emulators follow the popular [scikit-learn estimator template](https://scikit-learn.org/1.5/developers/develop.html#rolling-your-own-estimator) and deep learning models are supported with little overhead through PyTorch [@paszke_pytorch_2019] with a skorch [@tietz_skorch_2017] interface. AutoEmulate fills a gap in the current landscape of surrogate modeling tools as it’s both highly accessible for newcomers while providing cutting-edge emulators for experienced surrogate modelers. In contrast, existing libraries either focus on lower level implementations of specific models, like GPflow [@matthews_gpflow_2017] and GPytorch [@gardner_gpytorch_2018], provide multiple emulators but require to manually pre-process data, compare emulators and optimise hyperparameters like SMT in Python [@saves_smt_2024] or [Surrogates.jl](https://docs.sciml.ai/Surrogates/latest/) in Julia. +AutoEmulate automates emulator building, with the goal to eventually streamline the whole emulation pipeline. For people new to ML, AutoEmulate compares, optimises and evaluates a range of models to create an efficient emulator for their simulation in just a few lines of code. For experienced surrogate modellers, AutoEmulate provides a reference set of cutting-edge emulators to quickly benchmark new models against. The package includes classic emulators such as Radial Basis Functions and Gaussian Processes, established ML models like Gradient Boosting and Support Vector Machines, as well as experimental deep learning emulators such as [Conditional Neural Processes](https://yanndubs.github.io/Neural-Process-Family/text/Intro.html) [@garnelo_conditional_2018]. AutoEmulate is built to be extensible. Emulators follow the popular [scikit-learn estimator template](https://scikit-learn.org/1.5/developers/develop.html#rolling-your-own-estimator) and deep learning models are supported with little overhead using PyTorch [@paszke_pytorch_2019] with a skorch [@tietz_skorch_2017] interface. AutoEmulate fills a gap in the current landscape of surrogate modeling tools as it’s both highly accessible for newcomers while providing cutting-edge emulators for experienced surrogate modelers. In contrast, existing libraries either focus on lower level implementations of specific models, like GPflow [@matthews_gpflow_2017] and GPytorch [@gardner_gpytorch_2018], provide multiple emulators but require to manually pre-process data, compare emulators and optimise hyperparameters like SMT in Python [@saves_smt_2024] or [Surrogates.jl](https://docs.sciml.ai/Surrogates/latest/) in Julia. # Pipeline @@ -84,15 +84,15 @@ ae.summarise_cv() # cv scores for each model | LightGBM | lgbm | 0.6044 | 0.4930 | | Second Order Polynomial | sop | 0.8378 | 0.0297 | -After selecting an emulator based on its cross-validation performance, it can be evaluated on the held-out test set. AutoEmulate again calculates RMSE and R² scores, and provides several visualisations to assess the quality of predictions. +After comparing cross-validation metrics and plots, an emulator can be selected and evaluated on the held-out test set. ```python emulator = ae.get_model("GaussianProcess") # select fitted emulator ae.evaluate(emulator) # calculate test set scores -ae.plot_eval(emulator) # visualise test set predictions +ae.plot_eval(emulator, input_index=[0, 1]) # visualise test set predictions ``` -![Test set predictions](eval_2.png) +![Test set predictions for each input](eval_2.png) Finally, the emulator can be refitted on the combined training and test set data before applying it. It's now ready to be used as an efficient replacement for the original simulation, and is able to generate tens of thousands of new data points in negligible time using predict(). We also implemented global sensitivity analysis as a common emulator application, which decomposes the variance in the outputs into the contributions of the various simulation parameters and their interactions.