Update paper.md

openjournals/joss-reviews#5618 (comment)
sparks-baird · May 9, 2024 · aaf8c32 · aaf8c32
1 parent 7adef2b
commit aaf8c32
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/reports/paper.md b/reports/paper.md
@@ -44,17 +44,17 @@ bibliography: paper.bib
 
 # Summary
 
-The progress of a machine learning field is both tracked and propelled through the development of robust benchmarks. While significant progress has been made to create standardized, easy-to-use benchmarks for molecular discovery (e.g., [@brown_guacamol_2019]), this remains a challenge for solid-state material discovery [@xie_crystal_2022; @zhao_physics_2022; @alverson_generative_2022]. To address this limitation, we propose `matbench-genmetrics`, an open-source Python library for benchmarking generative models for crystal structures. We use four evaluation metrics inspired by Guacamol [@brown_guacamol_2019] and Crystal Diffusion Variational AutoEncoder (CDVAE) [@xie_crystal_2022]---validity, coverage, novelty, and uniqueness---to assess performance on Materials Project data splits using timeline-based cross-validation. We believe that `matbench-genmetrics` will provide the standardization and convenience required for rigorous benchmarking of crystal structure generative models. A visual overview of the `matbench-genmetrics` library is provided in \autoref{fig:summary}.
+The progress of a machine learning field is both tracked and propelled through the development of robust benchmarks. While significant progress has been made to create standardized, easy-to-use benchmarks for molecular discovery e.g., [@brown_guacamol_2019], this remains a challenge for solid-state material discovery [@xie_crystal_2022; @zhao_physics_2023; @alverson_generative_2022]. To address this limitation, we propose `matbench-genmetrics`, an open-source Python library for benchmarking generative models for crystal structures. We use four evaluation metrics inspired by Guacamol [@brown_guacamol_2019] and Crystal Diffusion Variational AutoEncoder (CDVAE) [@xie_crystal_2022]---validity, coverage, novelty, and uniqueness---to assess performance on Materials Project data splits using timeline-based cross-validation. We believe that `matbench-genmetrics` will provide the standardization and convenience required for rigorous benchmarking of crystal structure generative models. A visual overview of the `matbench-genmetrics` library is provided in \autoref{fig:summary}.
 
 ![Summary visualization of `matbench-genmetrics` to evaluate crystal generative model performance using validity, coverage, novelty, and uniqueness metrics based on calendar-time splits of experimentally determined Materials Project database entries. Validity is the comparison of distribution characteristics (space group number) between the generated materials and the training and test sets. Coverage is the number of matches between the generated structures and a held-out test set. Novelty is a comparison between the generated and training structures. Finally, uniqueness is a measure of the number of repeats within the generated structures (i.e., comparing the set of generated structures to itself). For in-depth descriptions and equations for the four metrics described above, see [https://matbench-genmetrics.readthedocs.io/en/latest/readme.html](https://matbench-genmetrics.readthedocs.io/en/latest/readme.html) and [https://matbench-genmetrics.readthedocs.io/en/latest/metrics.html](https://matbench-genmetrics.readthedocs.io/en/latest/metrics.html).\label{fig:summary}](figures/matbench-genmetrics.png)
 
 <!-- Maybe move the emojis beneath the name and horizontal line -->
 
 # Statement of need
 
-In the field of materials informatics, where materials science intersects with machine learning, benchmarks play a crucial role in assessing model performance and enabling fair comparisons among various tools and models. Typically, these benchmarks focus on evaluating the accuracy of predictive models for materials properties, utilizing well-established metrics such as mean absolute error (MAE) and root-mean-square error (RMSE) to measure performance against actual measurements. A standard practice involves splitting the data into two parts, with one serving as training data for model development and the other as test data for assessing performance [@dunn_benchmarking_2020].
+In the field of materials informatics, where materials science intersects with machine learning, benchmarks play a crucial role in assessing model performance and enabling fair comparisons among various tools and models. Typically, these benchmarks focus on evaluating the accuracy of predictive models for materials properties, utilizing well-established metrics such as mean absolute error and root-mean-square error to measure performance against actual measurements. A standard practice involves splitting the data into two parts, with one serving as training data for model development and the other as test data for assessing performance [@dunn_benchmarking_2020].
 
-However, benchmarking generative models, which aim to create entirely new data rather than focusing solely on predictive accuracy, presents unique challenges. While significant progress has been made in standardizing benchmarks for tasks like image generation and molecule synthesis, the field of crystal structure generative modeling lacks this level of standardization (this is separate from machine learning interatomic potentials, which have the robust and comprehensive [`matbench-discovery`](https://matbench-discovery.materialsproject.org/) [@riebesell_matbench_2024] and [Jarvis Leaderboard](https://pages.nist.gov/jarvis_leaderboard/) benchmarking frameworks [@choudhary_large_2023]). Molecular generative modeling benefits from widely adopted benchmark platforms such as Guacamol [@brown_guacamol_2019] and Moses [@polykovskiy_molecular_2020], which offer easy installation, usage guidelines, and leaderboards for tracking progress. In contrast, existing evaluations in crystal structure generative modeling, as seen in CDVAE [@xie_crystal_2022], FTCP [@ren_invertible_2022], PGCGM [@zhao_physics_2022], CubicGAN [@zhao_high-throughput_2021], and CrysTens [@alverson_generative_2022], lack standardization, pose challenges in terms of installation and application to new models and datasets, and lack publicly accessible leaderboards. While these evaluations are valuable within their respective scopes, there is a clear need for a dedicated benchmarking platform to promote standardization and facilitate robust comparisons.
+However, benchmarking generative models, which aim to create entirely new data rather than focusing solely on predictive accuracy, presents unique challenges. While significant progress has been made in standardizing benchmarks for tasks like image generation and molecule synthesis, the field of crystal structure generative modeling lacks this level of standardization (this is separate from machine learning interatomic potentials, which have the robust and comprehensive [`matbench-discovery`](https://matbench-discovery.materialsproject.org/) [@riebesell_matbench_2024] and [Jarvis Leaderboard](https://pages.nist.gov/jarvis_leaderboard/) benchmarking frameworks [@choudhary_large_2023]). Molecular generative modeling benefits from widely adopted benchmark platforms such as Guacamol [@brown_guacamol_2019] and Moses [@polykovskiy_molecular_2020], which offer easy installation, usage guidelines, and leaderboards for tracking progress. In contrast, existing evaluations in crystal structure generative modeling, as seen in CDVAE [@xie_crystal_2022], FTCP [@ren_invertible_2022], PGCGM [@zhao_physics_2023], CubicGAN [@zhao_high-throughput_2021], and CrysTens [@alverson_generative_2022], lack standardization, pose challenges in terms of installation and application to new models and datasets, and lack publicly accessible leaderboards. While these evaluations are valuable within their respective scopes, there is a clear need for a dedicated benchmarking platform to promote standardization and facilitate robust comparisons.
 
 In this work, we introduce `matbench-genmetrics`, a materials benchmarking platform for crystal structure generative models. We use concepts from molecular generative modeling benchmarking to create a set of evaluation metrics---validity, coverage, novelty, and uniqueness---which are broadly defined as follows:
 
@@ -70,7 +70,7 @@ In this work, we introduce `matbench-genmetrics`, a materials benchmarking platf
 - `MPTSMetrics`: class for loading `mp_time_split` data, calculating time-series cross-validation metrics, and saving results
 - Fixed benchmark classes for 10, 100, 1000, and 10000 generated structures
 
-Additionally, we introduce the `matbench_genmetrics.mp_time_split` namespace package as a complement to `matbench_genmetrics.core`. It provides a standardized dataset and cross-validation splits for evaluating the mentioned four metrics. Time-based splits have been utilized in materials informatics model validation, such as predicting future thermoelectric materials via word embeddings [@tshitoyan_unsupervised_2019], searching for efficient solar photoabsorption materials through multi-fidelity optimization [@palizhati_agents_2022], and predicting future materials stability trends via network models [@aykol_network_2019]. Recently, Hu et al. [@zhao_physics_2022] used what they call a rediscovery metric, referred to here as a coverage metric in line with molecular benchmarking terminology, to evaluate crystal structure generative models. While time-series splitting wasn't used, they showed that after generating millions of structures, only a small percentage of held-out structures had matches. These results highlight the difficulty (and robustness) of coverage tasks. By leveraging timeline metadata from the Materials Project database [@jain_commentary_2013] and creating a standard time-series splitting of data, `matbench_genmetrics.mp_time_split` enables rigorous evaluation of future discovery performance.
+Additionally, we introduce the `matbench_genmetrics.mp_time_split` namespace package as a complement to `matbench_genmetrics.core`. It provides a standardized dataset and cross-validation splits for evaluating the mentioned four metrics. Time-based splits have been utilized in materials informatics model validation, such as predicting future thermoelectric materials via word embeddings [@tshitoyan_unsupervised_2019], searching for efficient solar photoabsorption materials through multi-fidelity optimization [@palizhati_agents_2022], and predicting future materials stability trends via network models [@aykol_network_2019]. Recently, Hu et al. [@zhao_physics_2023] used what they call a rediscovery metric, referred to here as a coverage metric in line with molecular benchmarking terminology, to evaluate crystal structure generative models. While time-series splitting wasn't used, they showed that after generating millions of structures, only a small percentage of held-out structures had matches. These results highlight the difficulty (and robustness) of coverage tasks. By leveraging timeline metadata from the Materials Project database [@jain_commentary_2013] and creating a standard time-series splitting of data, `matbench_genmetrics.mp_time_split` enables rigorous evaluation of future discovery performance.
 
 The `matbench_genmetrics.mp_time_split` namespace package provides the following features: