diff --git a/paper/paper.md b/paper/paper.md index 956170d9..e85c563d 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -54,22 +54,21 @@ AutoEmulate automates emulator building, with the goal to eventually streamline # Pipeline -The inputs for AutoEmulate are X and y, where X is a 2D array (e.g. numpy-array, Pandas DataFrame) containing one simulation parameter per column and their values in rows, and y is an array containing the corresponding simulation outputs. A dataset X, y is usually constructed by sampling input parameters X using Latin Hypercube Sampling (McKay et al., 1979) and evaluating the simulation on these inputs to obtain outputs y. With X and y, we can create an emulator with AutoEmulate in just a few lines of code. +The inputs for AutoEmulate are X and y, where X is a 2D array (e.g. numpy-array, Pandas DataFrame) containing one simulation parameter per column and their values in rows, and y is an array containing the corresponding simulation outputs. A dataset X, y is usually constructed by sampling input parameters X using Latin Hypercube Sampling [@mckay_comparison_1979] and evaluating the simulation on these inputs to obtain outputs y. With X and y, we can create an emulator with AutoEmulate in just a few lines of code. ```python from autoemulate.compare import AutoEmulate ae = AutoEmulate() -ae.setup(X, y) # allows to customise pipeline -ae.compare() # runs the pipeline +ae.setup(X, y) # customise pipeline +ae.compare() # runs the pipeline ``` -Under the hood, AutoEmulate runs a complete ML pipeline. It splits the data into training and test sets, standardises inputs, fits a set of user-specified emulators, compares them using cross-validation and optionally optimises hyperparameters using pre-defined search spaces. The cross-validation results can then be visualised and summarised. -. +Under the hood, AutoEmulate runs a complete ML pipeline. It splits the data into training and test sets, standardises inputs, fits a set of user-specified emulators, compares them using cross-validation and optionally optimises hyperparameters using pre-defined search spaces. All these steps can be customised in `setup()`. After running `compare()`, the cross-validation results can be visualised and summarised. ```python -ae.plot_cv() # visualise results -ae.summarise_cv() # metrics for each model +ae.plot_cv() # visualise results +ae.summarise_cv() # metrics for each model ``` : Average cross-validation scores @@ -85,22 +84,22 @@ ae.summarise_cv() # metrics for each model | LightGBM | lgbm | 0.6044 | 0.4930 | | Second Order Polynomial | sop | 0.8378 | 0.0297 | -After choosing an emulator based on cross-validation metrics and visualisations, it can be evaluated on the test set, which defaults to be 20% of the original dataset. +After choosing an emulator based on cross-validation metrics and visualisations, it can be evaluated on the test set, with a default size of 20% of the original dataset. ```python emulator = ae.get_model("GaussianProcess") -ae.evaluate(emulator) # calculate test set scores -ae.plot_eval(emulator) # visualise test set predictions +ae.evaluate(emulator) # calculate test set scores +ae.plot_eval(emulator) # visualise test set predictions ``` -![Test set predictions](eval_3.png) +![Test set predictions](eval_2.png) -If the test-set performance is acceptable, the emulator can be refitted on the combined training and test data before applying it. It's now ready to be used as an efficient replacement for the original simulation by generating tens of thousands of new data points in seconds using predict(). We have also implemented global sensitivity analysis, a common use-case for emulators, which decomposes the variance in the outputs into the contributions of the various simulation parameters and their interactions. +If the test-set performance is acceptable, the emulator can be refitted on the combined training and test data before applying it. It's now ready to be used as an efficient replacement for the original simulation, allowing to generate tens of thousands of new data points in seconds using predict(). We implemented global sensitivity analysis as a common use-case, which decomposes the variance in the outputs into the contributions of the various simulation parameters and their interactions. ```python -emulator = ae.refit(emulator) # refit using full data -emulator.predict(X) # efficiently generate new samples -ae.sensitivity_analysis(emulator) # global SA with Sobol indices +emulator = ae.refit(emulator) # refit using full data +emulator.predict(X) # emulate! +ae.sensitivity_analysis(emulator) # global SA with Sobol indices ``` # Acknowledgements