addressed e.g.s, reseearch list, and CES pipeline list

CliMA · Apr 18, 2024 · 18d88b8 · 18d88b8
1 parent a580a24
commit 18d88b8
Showing 1 changed file with 17 additions and 9 deletions.
diff --git a/paper.md b/paper.md
@@ -51,25 +51,29 @@ bibliography: paper.bib
 
 # Summary
 
-A Julia language [@julia] package providing practical and modular implementation of ``Calibrate, Emulate, Sample" [@Cleary:2021], hereafter CES, an accelerated workflow for obtaining model parametric uncertainty is presented. This is also known as Bayesian inversion or uncertainty quantification. To apply CES one requires a computer model (written in any programming language) dependent on free parameters, a prior distribution encoding some prior knowledge about the distribution over the free parameters, and some data with which to constrain this prior distribution. The pipeline has three stages, most easily explained in reverse: the last stage is to draw samples (Sample) from the Bayesian posterior distribution, i.e. the prior distribution conditioned on the observed data; to accelerate and regularize this process we train statistical emulators to represent the user-provided parameter-to-data map (Emulate); the training points for these emulators are generated by the computer model, and selected adaptively around regions of high posterior mass (Calibrate). We describe CES as an accelerated workflow, as it is often able to use dramatically fewer evaluations of the computer model when compared with applying sampling algorithms, such as Markov chain Monte Carlo (MCMC), directly.
+A Julia language [@julia] package providing practical and modular implementation of ``Calibrate, Emulate, Sample" [@Cleary:2021], hereafter CES, an accelerated workflow for obtaining model parametric uncertainty is presented. This is also known as Bayesian inversion or uncertainty quantification. To apply CES one requires a computer model (written in any programming language) dependent on free parameters, a prior distribution encoding some prior knowledge about the distribution over the free parameters, and some data with which to constrain this prior distribution. The pipeline has three stages, most easily explained in reverse:
+1. the final stage, and goal of the workflow, is to draw samples (Sample) from the Bayesian posterior distribution, that is, the prior distribution conditioned on the observed data;
+2. to accelerate and regularize this process we train statistical emulators to represent the user-provided parameter-to-data map (Emulate);
+3. the training points for these emulators are generated by the computer model, and selected adaptively around regions of high posterior mass (Calibrate).
+We describe CES as an accelerated workflow, as it is often able to use dramatically fewer evaluations of the computer model when compared with applying sampling algorithms, such as Markov chain Monte Carlo (MCMC), directly.
 
 * Calibration tools: We recommend choosing adaptive training points with Ensemble Kalman methods such as EKI [@Iglesias:2013] and its variants [@Huang:2022]; and CES provides explicit utilities from the codebase EnsembleKalmanProcesses.jl [@Dunbar:2022a].
 * Emulation tools: CES integrates any statistical emulator, currently implemented are Gaussian Processes (GP) [@Rasmussen:2006], explicitly provided through packages SciKitLearn.jl [@scikit-learn] and GaussianProcesses.jl [@Fairbrother:2022], and Random Features [@Rahimi:2007;@Rahimi:2008;@Liu:2022], explicitly provided through [RandomFeatures.jl](https://doi.org/10.5281/zenodo.7141158) that can provide additional flexibility and scalability, particularly in higher dimensions.
-* Sampling tools: The regularized and accelerated sampling problem is solved with MCMC, and CES provides the variants of Random Walk Metropolis [@Metropolis:1953;@Sherlock:2010], and preconditioned Crank-Nicholson [@Cotter:2013], using APIs from [Turing.jl](https://turinglang.org/). Some regular emulator mean functions are differentiable, and including accelerations of derivative-based MCMC into CES (e.g. NUTS [@hoffman:2014], Barker [@livingstone:2022]) is an active direction of work. 
+* Sampling tools: The regularized and accelerated sampling problem is solved with MCMC, and CES provides the variants of Random Walk Metropolis [@Metropolis:1953;@Sherlock:2010], and preconditioned Crank-Nicholson [@Cotter:2013], using APIs from [Turing.jl](https://turinglang.org/). Some regular emulator mean functions are differentiable, and including accelerations of derivative-based MCMC into CES, [e.g., NUTS, @hoffman:2014; Barker, @livingstone:2022]; is an active direction of work. 
 
 To highlight code accessibility, we also provide a suite of detailed scientifically-inspired examples, with documentation that walks users through some use cases. Such use cases not only demonstrate the capability of the CES pipeline, but also teach users about typical interface and workflow experience.
 
 
 # Statement of need
 
-Computationally expensive computer codes for predictive modelling are ubiquitous across science and engineering disciplines. Free parameter values that exist within these modelling frameworks are typically constrained by observations to produce accurate and robust predictions about the system they are approximating numerically. In a Bayesian setting, this is viewed as evolving an initial parameter distribution (based on prior information) with the input of observed data, to a more informative data-consistent distribution (posterior). Unfortunately, this task is intensely computationally expensive, commonly requiring over $10^5$ evaluations of the expensive computer code (e.g. Random Walk Metropolis), with accelerations relying on intrusive model information, such as a derivative of the parameter-to-data map. CES is able to approximate and accelerate this process in a non-intrusive fashion and requiring only on the order of $10^2$ evaluations of the original computer model. This opens the doors for quantifying parametric uncertainty for a class of numerically intensive computer codes that has previously been unavailable.
+Computationally expensive computer codes for predictive modelling are ubiquitous across science and engineering disciplines. Free parameter values that exist within these modelling frameworks are typically constrained by observations to produce accurate and robust predictions about the system they are approximating numerically. In a Bayesian setting, this is viewed as evolving an initial parameter distribution (based on prior information) with the input of observed data, to a more informative data-consistent distribution (posterior). Unfortunately, this task is intensely computationally expensive, commonly requiring over $10^5$ evaluations of the expensive computer code (e.g., Random Walk Metropolis), with accelerations relying on intrusive model information, such as a derivative of the parameter-to-data map. CES is able to approximate and accelerate this process in a non-intrusive fashion and requiring only on the order of $10^2$ evaluations of the original computer model. This opens the doors for quantifying parametric uncertainty for a class of numerically intensive computer codes that has previously been unavailable.
 
 
 # State of the field
 
-In Julia there are a few tools for performing non-accelerated uncertainty quantification, from classical sensitivity analysis approaches, e.g., [UncertaintyQuantification.jl](https://zenodo.org/records/10149017), GlobalSensitivity.jl [@Dixit:2022], and MCMC, e.g., [Mamba.jl](https://github.com/brian-j-smith/Mamba.jl) or [Turing.jl](https://turinglang.org/). For computational efficiency, ensemble methods also provide approximate sampling (e.g., the Ensemble Kalman Sampler [@Garbuno-Inigo:2020b;@Dunbar:2022a]) though these only provide Gaussian approximations of the posterior.
+In Julia there are a few tools for performing non-accelerated uncertainty quantification, from classical sensitivity analysis approaches, for example, [UncertaintyQuantification.jl](https://zenodo.org/records/10149017), GlobalSensitivity.jl [@Dixit:2022], and MCMC, for example, [Mamba.jl](https://github.com/brian-j-smith/Mamba.jl) or [Turing.jl](https://turinglang.org/). For computational efficiency, ensemble methods also provide approximate sampling, [e.g., the Ensemble Kalman Sampler @Garbuno-Inigo:2020b;@Dunbar:2022a], though these only provide Gaussian approximations of the posterior.
 
-Accelerated uncertainty quantification tools also exist for the related approach of Approximate Bayesian Computation (ABC), e.g., GpABC [@Tankhilevich:2020] or [ApproxBayes.jl](https://github.com/marcjwilliams1/ApproxBayes.jl?tab=readme-ov-file); these tools both approximately sample from the posterior distribution. In ABC, this approximation comes from bypassing the likelihood that is usually required in sampling methods, such as MCMC. Instead, the goal of ABC is to replace the likelihood with a scalar-valued sampling objective that compares model and data. In CES, the approximation comes from learning the parameter-to-data map, then following this it calculates an explicit likelihood and uses exact sampling via MCMC. Some ABC algorithms also make use of statistical emulators to further accelerate sampling (GpABC). Although flexible, ABC encounters challenges due to the subjectivity of summary statistics and distance metrics, that may lead to approximation errors particularly in high-dimensional settings [@Nott:2018]. CES is more restrictive due to use of an explicit Gaussian likelihood, but also leverages this structure to deal with high dimensional data.
+Accelerated uncertainty quantification tools also exist for the related approach of Approximate Bayesian Computation (ABC), for example, GpABC [@Tankhilevich:2020] or [ApproxBayes.jl](https://github.com/marcjwilliams1/ApproxBayes.jl?tab=readme-ov-file); these tools both approximately sample from the posterior distribution. In ABC, this approximation comes from bypassing the likelihood that is usually required in sampling methods, such as MCMC. Instead, the goal of ABC is to replace the likelihood with a scalar-valued sampling objective that compares model and data. In CES, the approximation comes from learning the parameter-to-data map, then following this it calculates an explicit likelihood and uses exact sampling via MCMC. Some ABC algorithms also make use of statistical emulators to further accelerate sampling (GpABC). Although flexible, ABC encounters challenges due to the subjectivity of summary statistics and distance metrics, that may lead to approximation errors particularly in high-dimensional settings [@Nott:2018]. CES is more restrictive due to use of an explicit Gaussian likelihood, but also leverages this structure to deal with high dimensional data.
 
 Several other tools are available in other languages for a purpose of accelerated learning of the posterior distribution or posterior sampling. Two such examples, written in Python, approximate the log-posterior distribution directly with a Gaussian process: [PyVBMC](https://github.com/acerbilab/pyvbmc) [@Huggins:2023] additionaly uses variational approximations to calculate the normalization constant, and [GPry](https://github.com/jonaselgammal/GPry) [@Gammal:2023], which iteratively trains the GP with an active training point selection algorithm. Such algorithms are distinct from CES, which approximates the parameter-to-data map with the Gaussian process, and advocates ensemble Kalman methods to select training points. 
 
@@ -158,11 +162,15 @@ chain = MC.sample(
 
 A histogram of the samples from the CES algorithm is displayed in \autoref{fig:GP_2d_posterior}. We see that the posterior distribution contains the true value $(3.0, 7.0)$ with high probability.
 
-
-
-
 # Research projects using the package
-Some research projects that use this codebase, or modifications of it, are [@Bieli:2022;@Hillier:2022;@Dunbar:2021;@Howland:2022;@Dunbar:2022b;@Mansfield:2022;@King:2023].
+Some research projects that use this codebase, or modifications of it, are
+* [@Dunbar:2021]
+* [@Bieli:2022]
+* [@Hillier:2022]
+* [@Howland:2022]
+* [@Dunbar:2022b]
+* [@Mansfield:2022]
+* [@King:2023]
 
 # Acknowledgements