diff --git a/_downloads/a241d10bd0160ab9f8fc556af55900ae/train.txt b/_downloads/5fdddbed2260616231dbf7b0d94bb665/train.txt
similarity index 92%
rename from _downloads/a241d10bd0160ab9f8fc556af55900ae/train.txt
rename to _downloads/5fdddbed2260616231dbf7b0d94bb665/train.txt
index ca363cb9a..892b130f7 100644
--- a/_downloads/a241d10bd0160ab9f8fc556af55900ae/train.txt
+++ b/_downloads/5fdddbed2260616231dbf7b0d94bb665/train.txt
@@ -1,17 +1,17 @@
-2024-04-13 03:34:05 (INFO): Project root: /home/runner/work/ocp/ocp
+2024-04-13 15:38:18 (INFO): Project root: /home/runner/work/ocp/ocp
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/torch/cuda/amp/grad_scaler.py:126: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available. Disabling.
warnings.warn(
-2024-04-13 03:34:06 (WARNING): Detected old config, converting to new format. Consider updating to avoid potential incompatibilities.
-2024-04-13 03:34:06 (INFO): amp: true
+2024-04-13 15:38:19 (WARNING): Detected old config, converting to new format. Consider updating to avoid potential incompatibilities.
+2024-04-13 15:38:19 (INFO): amp: true
cmd:
- checkpoint_dir: fine-tuning/checkpoints/2024-04-13-03-33-20-ft-oxides
- commit: 6193b4d
+ checkpoint_dir: fine-tuning/checkpoints/2024-04-13-15-38-40-ft-oxides
+ commit: aa085b3
identifier: ft-oxides
- logs_dir: fine-tuning/logs/wandb/2024-04-13-03-33-20-ft-oxides
+ logs_dir: fine-tuning/logs/wandb/2024-04-13-15-38-40-ft-oxides
print_every: 10
- results_dir: fine-tuning/results/2024-04-13-03-33-20-ft-oxides
+ results_dir: fine-tuning/results/2024-04-13-15-38-40-ft-oxides
seed: 0
- timestamp_id: 2024-04-13-03-33-20-ft-oxides
+ timestamp_id: 2024-04-13-15-38-40-ft-oxides
dataset:
a2g_args:
r_energy: true
diff --git a/_images/6a185f29188599f8af7fbb8660f8e825f555e9f6202d1cde58fb3094687e12f6.png b/_images/6a185f29188599f8af7fbb8660f8e825f555e9f6202d1cde58fb3094687e12f6.png
new file mode 100644
index 000000000..0555899e7
Binary files /dev/null and b/_images/6a185f29188599f8af7fbb8660f8e825f555e9f6202d1cde58fb3094687e12f6.png differ
diff --git a/_sources/core/ase_dataset_creation.md b/_sources/core/ase_dataset_creation.md
new file mode 100644
index 000000000..8c6748692
--- /dev/null
+++ b/_sources/core/ase_dataset_creation.md
@@ -0,0 +1,90 @@
+
+# Making and using ASE datasets
+
+There are multiple ways to train and evaluate OCP models on data other than OC20 and OC22. Writing an LMDB is the most performant option. However, ASE-based dataset formats are also included as a convenience for people with existing data who simply want to try OCP tools without needing to learn about LMDBs.
+
+
+## Using an ASE Database
+
+If your data is already in an [ASE Database](https://databases.fysik.dtu.dk/ase/ase/db/db.html), no additional preprocessing is necessary before running training/prediction! Although the ASE DB backends may not be sufficiently high throughput for all use cases, they are generally considered "fast enough" to train on a reasonably-sized dataset with 1-2 GPUs or predict with a single GPU. If you want to effictively utilize more resources than this, please be aware of the potential for this bottleneck and consider writing your data to an LMDB. If your dataset is small enough to fit in CPU memory, use the `keep_in_memory: True` option to avoid this bottleneck.
+
+To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset:
+
+```yaml
+dataset:
+ format: ase_db
+ train:
+ src: # The path/address to your ASE DB
+ connect_args:
+ # Keyword arguments for ase.db.connect()
+ select_args:
+ # Keyword arguments for ase.db.select()
+ # These can be used to query/filter the ASE DB
+ a2g_args:
+ r_energy: True
+ r_forces: True
+ # Set these if you want to train on energy/forces
+ # Energy/force information must be in the ASE DB!
+ keep_in_memory: False # Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets!
+ include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
+ val:
+ src:
+ a2g_args:
+ r_energy: True
+ r_forces: True
+ test:
+ src:
+ a2g_args:
+ r_energy: False
+ r_forces: False
+ # It is not necessary to have energy or forces if you are just making predictions.
+```
+## Using ASE-Readable Files
+
+It is possible to train/predict directly on ASE-readable files. This is only recommended for smaller datasets, as directories of many small files do not scale efficiently on all computing infrastructures. There are two options for loading data with the ASE reader:
+
+### Single-Structure Files
+This dataset assumes a single structure will be obtained from each file:
+
+```yaml
+dataset:
+ format: ase_read
+ train:
+ src: # The folder that contains ASE-readable files
+ pattern: # Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif".
+ include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
+
+ ase_read_args:
+ # Keyword arguments for ase.io.read()
+ a2g_args:
+ # Include energy and forces for training purposes
+ # If True, the energy/forces must be readable from the file (ex. OUTCAR)
+ r_energy: True
+ r_forces: True
+ keep_in_memory: False
+```
+
+### Multi-structure Files
+This dataset supports reading files that each contain multiple structure (for example, an ASE .traj file). Using an index file, which tells the dataset how many structures each file contains, is recommended. Otherwise, the dataset is forced to load every file at startup and count the number of structures!
+
+```yaml
+dataset:
+ format: ase_read_multi
+ train:
+ index_file: Filepath to an index file which contains each filename and the number of structures in each file. e.g.:
+ /path/to/relaxation1.traj 200
+ /path/to/relaxation2.traj 150
+ ...
+
+ # If using an index file, the src and pattern are not necessary
+ src: # The folder that contains ASE-readable files
+ pattern: # Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz".
+
+ ase_read_args:
+ # Keyword arguments for ase.io.read()
+ a2g_args:
+ # Include energy and forces for training purposes
+ r_energy: True
+ r_forces: True
+ keep_in_memory: False
+```
diff --git a/_sources/tutorials/fine-tuning/fine-tuning-oxides.md b/_sources/core/fine-tuning/fine-tuning-oxides.md
similarity index 98%
rename from _sources/tutorials/fine-tuning/fine-tuning-oxides.md
rename to _sources/core/fine-tuning/fine-tuning-oxides.md
index 1658b648a..702bf5123 100644
--- a/_sources/tutorials/fine-tuning/fine-tuning-oxides.md
+++ b/_sources/core/fine-tuning/fine-tuning-oxides.md
@@ -22,7 +22,7 @@ This data set shows equations of state for several oxide/polymorph combinations.
+++
-First we get the checkpoint that we want. According to the [MODELS](../../core/MODELS.md) the GemNet-OC OC20+OC22 combination has an energy MAE of 0.483 which seems like a good place to start. This model was trained on oxides.
+First we get the checkpoint that we want. According to the [MODELS](../../core/models) the GemNet-OC OC20+OC22 combination has an energy MAE of 0.483 which seems like a good place to start. This model was trained on oxides.
We get this checkpoint here.
diff --git a/_sources/tutorials/gotchas.md b/_sources/core/gotchas.md
similarity index 97%
rename from _sources/tutorials/gotchas.md
rename to _sources/core/gotchas.md
index 8bc282a15..cea3f4d41 100644
--- a/_sources/tutorials/gotchas.md
+++ b/_sources/core/gotchas.md
@@ -82,7 +82,7 @@ from ocpmodels.models.model_registry import model_name_to_local_file
checkpoint_path = model_name_to_local_file('GemNet-OC All', local_cache='/tmp/ocp_checkpoints/')
with contextlib.redirect_stdout(StringIO()) as _:
- calc = OCPCalculator(checkpoint_path=os.path.expanduser(checkpoint_path), cpu=False)
+ calc = OCPCalculator(checkpoint_path=checkpoint_path, cpu=False)
@@ -227,7 +227,7 @@ from ocpmodels.models.model_registry import model_name_to_local_file
from ocpmodels.common.relaxation.ase_utils import OCPCalculator
checkpoint_path = model_name_to_local_file('eSCN-L6-M3-Lay20 All+MD', local_cache='/tmp/ocp_checkpoints/')
-calc = OCPCalculator(checkpoint_path=os.path.expanduser(checkpoint_path), cpu=True)
+calc = OCPCalculator(checkpoint_path=checkpoint_path, cpu=True)
from ase.build import fcc111, add_adsorbate
from ase.optimize import BFGS
@@ -256,7 +256,7 @@ from ocpmodels.models.model_registry import model_name_to_local_file
checkpoint_path = model_name_to_local_file('eSCN-L6-M3-Lay20 All+MD', local_cache='/tmp/ocp_checkpoints/')
from ocpmodels.common.relaxation.ase_utils import OCPCalculator
-calc = OCPCalculator(checkpoint_path=os.path.expanduser(checkpoint_path), cpu=True)
+calc = OCPCalculator(checkpoint_path=checkpoint_path, cpu=True)
from ase.build import fcc111, add_adsorbate
from ase.optimize import BFGS
diff --git a/_sources/tutorials/advanced/mass-inference.md b/_sources/core/inference.md
similarity index 92%
rename from _sources/tutorials/advanced/mass-inference.md
rename to _sources/core/inference.md
index 3fd04a195..4ce9afd68 100644
--- a/_sources/tutorials/advanced/mass-inference.md
+++ b/_sources/core/inference.md
@@ -30,6 +30,21 @@ You can retrieve the dataset below. In this notebook we learn how to do "mass in
! ase db data.db
```
+Inference on this file will be fast if we have a gpu, but if we don't this could take a while. To keep things fast for the automated builds, we'll just select the first 100 structures so it's still approachable with just a CPU.
+Comment or skip this block to use the whole dataset!
+
+```{code-cell} ipython3
+! cp data.db full_data.db
+import ase.db
+import numpy as np
+
+with ase.db.connect('full_data.db') as full_db:
+ with ase.db.connect('data.db') as subset_db:
+ for i in range(100):
+ subset_db.write(full_db.get_atoms(i)))
+
+```
+
You have to choose a checkpoint to start with. The newer checkpoints may require too much memory for this environment.
```{code-cell} ipython3
@@ -145,7 +160,7 @@ We include this here just to show that:
```{code-cell} ipython3
from ocpmodels.common.relaxation.ase_utils import OCPCalculator
-calc = OCPCalculator(checkpoint_path=os.path.expanduser(checkpoint_path), cpu=False)
+calc = OCPCalculator(checkpoint_path=checkpoint_path, cpu=False)
```
```{code-cell} ipython3
diff --git a/_sources/core/INSTALL.md b/_sources/core/install.md
similarity index 100%
rename from _sources/core/INSTALL.md
rename to _sources/core/install.md
diff --git a/_sources/core/LICENSE.md b/_sources/core/license.md
similarity index 100%
rename from _sources/core/LICENSE.md
rename to _sources/core/license.md
diff --git a/_sources/legacy_tutorials/lmdb_dataset_creation.md b/_sources/core/lmdb_dataset_creation.md
similarity index 94%
rename from _sources/legacy_tutorials/lmdb_dataset_creation.md
rename to _sources/core/lmdb_dataset_creation.md
index 2021026c2..af9e4b041 100644
--- a/_sources/legacy_tutorials/lmdb_dataset_creation.md
+++ b/_sources/core/lmdb_dataset_creation.md
@@ -11,7 +11,9 @@ kernelspec:
name: python3
---
-### OCP LMDB Dataset Tutorial
+# Making LMDB Datasets (original format)
+
+Storing your data in an LMDB ensures very fast random read speeds for the fastest supported throughput. This is the recommended option for the majority of OCP use cases. For more information about writing your data to an LMDB, please see the [LMDB Dataset Tutorial](https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb).
This notebook provides an overview of how to create LMDB datasets to be used with the OCP repo. This tutorial is intended for those who wish to use OCP to train on their own datasets. Those interested in just using OCP data need not worry about these steps as they've been automated as part of the download script: https://github.com/Open-Catalyst-Project/ocp/blob/master/scripts/download_data.py.
diff --git a/_sources/core/MODELS.md b/_sources/core/model_checkpoints.md
similarity index 99%
rename from _sources/core/MODELS.md
rename to _sources/core/model_checkpoints.md
index cc6449a90..f4fb51642 100644
--- a/_sources/core/MODELS.md
+++ b/_sources/core/model_checkpoints.md
@@ -1,4 +1,4 @@
-# Pretrained OCP model checkpoints
+# Pretrained model checkpoints
This page summarizes all the pretrained models released as part of the [Open Catalyst Project](https://opencatalystproject.org/). All models were trained using this codebase.
diff --git a/_sources/core/FAQ.md b/_sources/core/model_faq.md
similarity index 99%
rename from _sources/core/FAQ.md
rename to _sources/core/model_faq.md
index b0f84c8e7..b0f75f5e1 100644
--- a/_sources/core/FAQ.md
+++ b/_sources/core/model_faq.md
@@ -1,4 +1,4 @@
-# Frequently Asked Questions
+# Model FAQ
If you don't find your question answered here, please feel free to [file a GitHub issue](https://github.com/open-catalyst-project/ocp/issues) or [post on the discussion board](https://discuss.opencatalystproject.org/).
diff --git a/_sources/core/TRAIN.md b/_sources/core/model_training.md
similarity index 78%
rename from _sources/core/TRAIN.md
rename to _sources/core/model_training.md
index 38b03da80..83bd33a7f 100644
--- a/_sources/core/TRAIN.md
+++ b/_sources/core/model_training.md
@@ -1,5 +1,4 @@
-# Training and evaluating models on OCP datasets
-
+# Training and evaluating custom models on OCP datasets
## Getting Started
@@ -340,98 +339,3 @@ EvalAI expects results to be structured in a specific format for a submission to
Where `file.npz` corresponds to the respective `[s2ef/is2re]_predictions.npz` files generated for the corresponding task. The final submission file will be written to `submission_file.npz` (rename accordingly). The `dataset` argument specifies which dataset is being considered — this only needs to be set for OC22 predictions because OC20 is the default.
3. Upload `submission_file.npz` to EvalAI.
-
-# Using Your Own Data
-
-There are multiple ways to train and evaluate OCP models on data other than OC20 and OC22. Writing an LMDB is the most performant option. However, ASE-based dataset formats are also included as a convenience for people with existing data who simply want to try OCP tools without needing to learn about LMDBs.
-
-This tutorial will briefly discuss the basic use of these dataset formats. For more detailed information about the ASE datasets, see the [source code and docstrings](ocpmodels/datasets/ase_datasets.py).
-
-## Writing an LMDB
-
-Storing your data in an LMDB ensures very fast random read speeds for the fastest supported throughput. This is the recommended option for the majority of OCP use cases. For more information about writing your data to an LMDB, please see the [LMDB Dataset Tutorial](https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb).
-
-## Using an ASE Database
-
-If your data is already in an [ASE Database](https://databases.fysik.dtu.dk/ase/ase/db/db.html), no additional preprocessing is necessary before running training/prediction! Although the ASE DB backends may not be sufficiently high throughput for all use cases, they are generally considered "fast enough" to train on a reasonably-sized dataset with 1-2 GPUs or predict with a single GPU. If you want to effictively utilize more resources than this, please be aware of the potential for this bottleneck and consider writing your data to an LMDB. If your dataset is small enough to fit in CPU memory, use the `keep_in_memory: True` option to avoid this bottleneck.
-
-To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset:
-
-```yaml
-dataset:
- format: ase_db
- train:
- src: # The path/address to your ASE DB
- connect_args:
- # Keyword arguments for ase.db.connect()
- select_args:
- # Keyword arguments for ase.db.select()
- # These can be used to query/filter the ASE DB
- a2g_args:
- r_energy: True
- r_forces: True
- # Set these if you want to train on energy/forces
- # Energy/force information must be in the ASE DB!
- keep_in_memory: False # Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets!
- include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
- val:
- src:
- a2g_args:
- r_energy: True
- r_forces: True
- test:
- src:
- a2g_args:
- r_energy: False
- r_forces: False
- # It is not necessary to have energy or forces if you are just making predictions.
-```
-## Using ASE-Readable Files
-
-It is possible to train/predict directly on ASE-readable files. This is only recommended for smaller datasets, as directories of many small files do not scale efficiently on all computing infrastructures. There are two options for loading data with the ASE reader:
-
-### Single-Structure Files
-This dataset assumes a single structure will be obtained from each file:
-
-```yaml
-dataset:
- format: ase_read
- train:
- src: # The folder that contains ASE-readable files
- pattern: # Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif".
- include_relaxed_energy: False # Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
-
- ase_read_args:
- # Keyword arguments for ase.io.read()
- a2g_args:
- # Include energy and forces for training purposes
- # If True, the energy/forces must be readable from the file (ex. OUTCAR)
- r_energy: True
- r_forces: True
- keep_in_memory: False
-```
-
-### Multi-structure Files
-This dataset supports reading files that each contain multiple structure (for example, an ASE .traj file). Using an index file, which tells the dataset how many structures each file contains, is recommended. Otherwise, the dataset is forced to load every file at startup and count the number of structures!
-
-```yaml
-dataset:
- format: ase_read_multi
- train:
- index_file: Filepath to an index file which contains each filename and the number of structures in each file. e.g.:
- /path/to/relaxation1.traj 200
- /path/to/relaxation2.traj 150
- ...
-
- # If using an index file, the src and pattern are not necessary
- src: # The folder that contains ASE-readable files
- pattern: # Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz".
-
- ase_read_args:
- # Keyword arguments for ase.io.read()
- a2g_args:
- # Include energy and forces for training purposes
- r_energy: True
- r_forces: True
- keep_in_memory: False
-```
diff --git a/_sources/core/ocpapi.md b/_sources/core/ocpapi.md
new file mode 100644
index 000000000..c6cedad0a
--- /dev/null
+++ b/_sources/core/ocpapi.md
@@ -0,0 +1,235 @@
+---
+jupytext:
+ cell_metadata_filter: -all
+ formats: md:myst
+ main_language: python
+ text_representation:
+ extension: .md
+ format_name: myst
+ format_version: 0.13
+ jupytext_version: 1.16.1
+---
+
+# ocpapi
+
+[![CircleCI](https://dl.circleci.com/status-badge/img/gh/Open-Catalyst-Project/ocpapi/tree/main.svg?style=shield)](https://dl.circleci.com/status-badge/redirect/gh/Open-Catalyst-Project/ocpapi/tree/main) [![codecov](https://codecov.io/gh/Open-Catalyst-Project/ocpapi/graph/badge.svg?token=66Z7Y7QUUW)](https://codecov.io/gh/Open-Catalyst-Project/ocpapi)
+
+Python library for programmatic use of the [Open Catalyst Demo](https://open-catalyst.metademolab.com/). Users unfamiliar with the Open Catalyst Demo are encouraged to read more about it before continuing.
+
+## Installation
+
+Ensure you have Python 3.9.1 or newer, and install `ocpapi` using:
+
+```{code-cell} ipython3
+%%sh
+pip install ocpapi
+```
+
+## Quickstart
+
+The following examples are used to search for *OH binding sites on Pt surfaces. They use the `find_adsorbate_binding_sites` function, which is a high-level workflow on top of other methods included in this library. Once familiar with this routine, users are encouraged to learn about lower-level methods and features that support more advanced use cases.
+
+### Note about async methods
+
+This package relies heavily on [asyncio](https://docs.python.org/3/library/asyncio.html). The examples throughout this document can be copied to a python repl launched with:
+
+```{code-cell} ipython3
+%%sh
+$ python -m asyncio
+```
+
+Alternatively, an async function can be run in a script by wrapping it with [asyncio.run()](https://docs.python.org/3/library/asyncio-runner.html#asyncio.run):
+
+```{code-cell} ipython3
+import asyncio
+from ocpapi import find_adsorbate_binding_sites
+
+asyncio.run(find_adsorbate_binding_sites(...))
+```
+
+### Search over all surfaces
+
+```{code-cell} ipython3
+from ocpapi import find_adsorbate_binding_sites
+
+results = await find_adsorbate_binding_sites(
+ adsorbate="*OH",
+ bulk="mp-126",
+)
+```
+
+Users will be prompted to select one or more surfaces that should be relaxed.
+
+Input to this function includes:
+
+* The name of the adsorbate to place
+* A unique ID of the bulk structure from which surfaces will be generated
+
+This function will perform the following steps:
+
+1. Enumerate surfaces of the bulk material
+2. On each surface, enumerate initial guesses for adorbate binding sites
+3. Run local force-based relaxations of each adsorbate placement
+
+In addition, this handles:
+
+* Retrying failed calls to the Open Catalyst Demo API
+* Retrying submission of relaxations when they are rate limited
+
+This should take 2-10 minutes to finish while tens to hundreds (depending on the number of surfaces that are selected) of individual adsorbate placements are relaxed on unique surfaces of Pt. Each of the objects in the returned list includes (among other details):
+
+* Information about the surface being searched, including its structure and Miller indices
+* The initial positions of the adsorbate before relaxation
+* The final structure after relaxation
+* The predicted energy of the final structure
+* The predicted force on each atom in the final structure
+
++++
+
+### Supported bulks and adsorbates
+
+A finite set of bulk materials and adsorbates can be referenced by ID throughout the OCP API. The lists of supported values can be viewed in two ways.
+
+1. Visit the UI at https://open-catalyst.metademolab.com/demo and explore the lists in Step 1 and Step 3.
+2. Use the low-level client that ships with this library:
+
+```{code-cell} ipython3
+from ocpapi import Client
+
+client = Client()
+
+bulks = await client.get_bulks()
+print({b.src_id: b.formula for b in bulks.bulks_supported})
+
+adsorbates = await client.get_adsorbates()
+print(adsorbates.adsorbates_supported)
+```
+
+### Persisting results
+
+**Results should be saved whenever possible in order to avoid expensive recomputation.**
+
+Assuming `results` was generated with the `find_adsorbate_binding_sites` method used above, it is an `AdsorbateBindingSites` object. This can be saved to file with:
+
+```{code-cell} ipython3
+with open("results.json", "w") as f:
+ f.write(results.to_json())
+```
+
+Similarly, results can be read back from file to an `AdsorbateBindingSites` object with:
+
+```{code-cell} ipython3
+from ocpapi import AdsorbateBindingSites
+
+with open("results.json", "r") as f:
+ results = AdsorbateBindingSites.from_json(f.read())
+```
+
+### Viewing results in the web UI
+
+Relaxation results can be viewed in a web UI. For example, https://open-catalyst.metademolab.com/results/7eaa0d63-83aa-473f-ac84-423ffd0c67f5 shows the results of relaxing *OH on a Pt (1,1,1) surface; the uuid, "7eaa0d63-83aa-473f-ac84-423ffd0c67f5", is referred to as the `system_id`.
+
+Extending the examples above, the URLs to visualize the results of relaxations on each Pt surface can be obtained with:
+
+```{code-cell} ipython3
+urls = [
+ slab.ui_url
+ for slab in results.slabs
+]
+```
+
+## Advanced usage
+
+### Changing the model type
+
+The API currently supports two models:
+* `equiformer_v2_31M_s2ef_all_md` (default): https://arxiv.org/abs/2306.12059
+* `gemnet_oc_base_s2ef_all_md`: https://arxiv.org/abs/2204.02782
+
+A specific model type can be requested with:
+
+```{code-cell} ipython3
+from ocpapi import find_adsorbate_binding_sites
+
+results = await find_adsorbate_binding_sites(
+ adsorbate="*OH",
+ bulk="mp-126",
+ model="gemnet_oc_base_s2ef_all_md",
+)
+```
+
+### Skip relaxation approval prompts
+
+Calls to `find_adsorbate_binding_sites()` will, by default, show the user all pending relaxations and ask for approval before they are submitted. In order to run the relaxations automatically without manual approval, `adslab_filter` can be set to a function that automatically approves any or all adsorbate/slab (adslab) configurations.
+
+Run relaxations for all slabs that are generated:
+
+```{code-cell} ipython3
+from ocpapi import find_adsorbate_binding_sites, keep_all_slabs
+
+results = await find_adsorbate_binding_sites(
+ adsorbate="*OH",
+ bulk="mp-126",
+ adslab_filter=keep_all_slabs(),
+)
+```
+
+Run relaxations only for slabs with Miller Indices in the input set:
+
+```{code-cell} ipython3
+from ocpapi import find_adsorbate_binding_sites, keep_slabs_with_miller_indices
+
+results = await find_adsorbate_binding_sites(
+ adsorbate="*OH",
+ bulk="mp-126",
+ adslab_filter=keep_slabs_with_miller_indices([(1, 0, 0), (1, 1, 1)]),
+)
+```
+
+### Converting to [ase.Atoms](https://wiki.fysik.dtu.dk/ase/ase/atoms.html) objects
+
+**Important! The `to_ase_atoms()` method described below will fail with an import error if [ase](https://wiki.fysik.dtu.dk/ase) is not installed.**
+
+Two classes have support for generating [ase.Atoms](https://wiki.fysik.dtu.dk/ase/ase/atoms.html) objects:
+* `ocpapi.Atoms.to_ase_atoms()`: Adds unit cell, atomic positions, and other structural information to the returned `ase.Atoms` object.
+* `ocpapi.AdsorbateSlabRelaxationResult.to_ase_atoms()`: Adds the same structure information to the `ase.Atoms` object. Also adds the predicted forces and energy of the relaxed structure, which can be accessed with the `ase.Atoms.get_potential_energy()` and `ase.Atoms.get_forces()` methods.
+
+For example, the following would generate an `ase.Atoms` object for the first relaxed adsorbate configuration on the first slab generated for *OH binding on Pt:
+
+```{code-cell} ipython3
+from ocpapi import find_adsorbate_binding_sites
+
+results = await find_adsorbate_binding_sites(
+ adsorbate="*OH",
+ bulk="mp-126",
+)
+
+ase_atoms = results.slabs[0].configs[0].to_ase_atoms()
+```
+
+### Converting to other structure formats
+
+From an `ase.Atoms` object (see previous section), is is possible to [write to other structure formats](https://wiki.fysik.dtu.dk/ase/ase/io/io.html#ase.io.write). Extending the example above, the `ase_atoms` object could be written to a [VASP POSCAR file](https://www.vasp.at/wiki/index.php/POSCAR) with:
+
+```{code-cell} ipython3
+from ase.io import write
+
+write("POSCAR", ase_atoms, "vasp")
+```
+
+## License
+
+`ocpapi` is released under the [MIT License](LICENSE).
+
+## Citing `ocpapi`
+
+If you use `ocpapi` in your research, please consider citing the [AdsorbML paper](https://www.nature.com/articles/s41524-023-01121-5) (in addition to the relevant datasets / models used):
+
+```bibtex
+@article{lan2023adsorbml,
+ title={{AdsorbML}: a leap in efficiency for adsorption energy calculations using generalizable machine learning potentials},
+ author={Lan*, Janice and Palizhati*, Aini and Shuaibi*, Muhammed and Wood*, Brandon M and Wander, Brook and Das, Abhishek and Uyttendaele, Matt and Zitnick, C Lawrence and Ulissi, Zachary W},
+ journal={npj Computational Materials},
+ year={2023},
+}
+```
diff --git a/_sources/core/papers_using_models.md b/_sources/core/papers_using_models.md
new file mode 100644
index 000000000..a59b7589f
--- /dev/null
+++ b/_sources/core/papers_using_models.md
@@ -0,0 +1,5 @@
+# Studies that have leveraged OCP models
+
+Many papers have now used the latest OCP models to accelerate screening and discovery efforts and enable new computational chemistry simulations!
+We highlight some here just to give an idea of the breadth of possibilities and how they have been used. Feel free to reach out (or submit PRs with links to your papers if you want them included)!
+
diff --git a/_sources/core/QUICKSTART.md b/_sources/core/quickstart.md
similarity index 98%
rename from _sources/core/QUICKSTART.md
rename to _sources/core/quickstart.md
index 246684d76..a7b3eea3c 100644
--- a/_sources/core/QUICKSTART.md
+++ b/_sources/core/quickstart.md
@@ -11,7 +11,7 @@ kernelspec:
name: python3
---
-Hello World with OCP models!
+Using pre-trained models in ASE
----------
1. First, install OCP in a fresh python environment using one of the approaches in [installation documentation](INSTALL).
diff --git a/_sources/tutorials/adsorbml_walkthrough.md b/_sources/tutorials/adsorbml_walkthrough.md
new file mode 100644
index 000000000..15b717ef3
--- /dev/null
+++ b/_sources/tutorials/adsorbml_walkthrough.md
@@ -0,0 +1,230 @@
+---
+jupytext:
+ text_representation:
+ extension: .md
+ format_name: myst
+ format_version: 0.13
+ jupytext_version: 1.16.1
+kernelspec:
+ display_name: Python 3 (ipykernel)
+ language: python
+ name: python3
+---
+
+# AdsorbML tutorial
+
+```{code-cell} ipython3
+from ocpmodels.common.relaxation.ase_utils import OCPCalculator
+import ase.io
+from ase.optimize import BFGS
+
+from ocdata.core import Adsorbate, AdsorbateSlabConfig, Bulk, Slab
+import os
+from glob import glob
+import pandas as pd
+from ocdata.utils import DetectTrajAnomaly
+from ocdata.utils.vasp import write_vasp_input_files
+
+# Optional - see below
+import numpy as np
+from dscribe.descriptors import SOAP
+from scipy.spatial.distance import pdist, squareform
+from x3dase.visualize import view_x3d_n
+```
+
+## Enumerate the adsorbate-slab configurations to run relaxations on
+
++++
+
+Be sure to set the path to the bulk and adsorbate pickle files in `ocdata/configs/paths.py` or pass the paths as an argument. The database pickles can be found in `ocdata/databases/pkls`. AdsorbML incorporates random placement, which is especially useful for more complicated adsorbates which may have many degrees of freedom. I have opted sample a few random placements and a few heuristic. Here I am using *CO on copper (1,1,1) as an example.
+
+```{code-cell} ipython3
+bulk_src_id = "mp-30"
+adsorbate_smiles = "*CO"
+
+bulk = Bulk(bulk_src_id_from_db = bulk_src_id, bulk_db_path = "your-path-here.pkl")
+adsorbate = Adsorbate(adsorbate_smiles_from_db=adsorbate_smiles, adsorbate_db_path = "your-path-here.pkl")
+slabs = Slab.from_bulk_get_specific_millers(bulk = bulk, specific_millers=(1,1,1))
+
+# There may be multiple slabs with this miller index.
+# For demonstrative purposes we will take the first entry.
+slab = slabs[0]
+```
+
+```{code-cell} ipython3
+# Perform heuristic placements
+heuristic_adslabs = AdsorbateSlabConfig(slabs[0], adsorbate, mode="heuristic")
+
+# Perform random placements
+# (for AdsorbML we use `num_sites = 100` but we will use 4 for brevity here)
+random_adslabs = AdsorbateSlabConfig(slabs[0], adsorbate, mode="random_site_heuristic_placement", num_sites = 4)
+```
+
+## Run ML relaxations:
+
+There are 2 options for how to do this.
+ 1. Using `OCPCalculator` as the calculator within the ASE framework
+ 2. By writing objects to lmdb and relaxing them using `main.py` in the ocp repo
+
+(1) is really only adequate for small stuff and it is what I will show here, but if you plan to run many relaxations, you should definitely use (2). More details about writing lmdbs has been provided [here](https://github.com/Open-Catalyst-Project/ocp/blob/main/tutorials/lmdb_dataset_creation.ipynb) - follow the IS2RS/IS2RE instructions. And more information about running relaxations once the lmdb has been written is [here](https://github.com/Open-Catalyst-Project/ocp/blob/main/TRAIN.md#initial-structure-to-relaxed-structure-is2rs).
+
+You need to provide the calculator with a path to a model checkpoint file. That can be downloaded [here](https://github.com/Open-Catalyst-Project/ocp/blob/main/MODELS.md)
+
+```{code-cell} ipython3
+checkpoint_path = "your-path-here.pt"
+os.makedirs(f"data/{bulk}_{adsorbate}", exist_ok=True)
+
+# Define the calculator
+calc = OCPCalculator(checkpoint=checkpoint_path) # if you have a gpu, add `cpu=False` to speed up calculations
+
+adslabs = [*heuristic_adslabs.atoms_list, *random_adslabs.atoms_list]
+# Set up the calculator
+for idx, adslab in enumerate(adslabs):
+ adslab.calc = calc
+ opt = BFGS(adslab, trajectory=f"data/{bulk}_{adsorbate}/{idx}.traj")
+ opt.run(fmax=0.05, steps=100) # For the AdsorbML results we used fmax = 0.02 and steps = 300, but we will use less strict values for brevity.
+```
+
+## Parse the trajectories and post-process
+
+As a post-processing step we check to see if:
+1. the adsorbate desorbed
+2. the adsorbate disassociated
+3. the adsorbate intercalated
+4. the surface has changed
+
+We check these because they effect our referencing scheme and may result in erroneous energies. For (4), the relaxed surface should really be supplied as well. It will be necessary when correcting the SP / RX energies later. Since we don't have it here, we will ommit supplying it, and the detector will instead compare the initial and final slab from the adsorbate-slab relaxation trajectory. If a relaxed slab is provided, the detector will compare it and the slab after the adsorbate-slab relaxation. The latter is more correct! Note: for the results in the AdsorbML paper, we did not check if the adsorbate was intercalated (`is_adsorbate_intercalated()`) because it is a new addition.
+
+```{code-cell} ipython3
+# Iterate over trajs to extract results
+results = []
+for file in glob(f"data/{bulk}_{adsorbate}/*.traj"):
+ rx_id = file.split("/")[-1].split(".")[0]
+ traj = ase.io.read(file, ":")
+
+ # Check to see if the trajectory is anomolous
+ initial_atoms = traj[0]
+ final_atoms = traj[-1]
+ atom_tags = initial_atoms.get_tags()
+ detector = DetectTrajAnomaly(initial_atoms, final_atoms, atom_tags)
+ anom = (
+ detector.is_adsorbate_dissociated()
+ or detector.is_adsorbate_desorbed()
+ or detector.has_surface_changed()
+ or detector.is_adsorbate_intercalated()
+ )
+ rx_energy = traj[-1].get_potential_energy()
+ results.append({"relaxation_idx": rx_id, "relaxed_atoms": traj[-1],
+ "relaxed_energy_ml": rx_energy, "anomolous": anom})
+```
+
+```{code-cell} ipython3
+df = pd.DataFrame(results)
+df
+```
+
+```{code-cell} ipython3
+#scrap anomalies
+df = df[~df.anomolous].copy().reset_index()
+```
+
+## (Optional) Deduplicate structures
+We may have enumerated very similar structures or structures may have relaxed to the same configuration. For this reason, it is advantageous to cull systems if they are very similar. This results in marginal improvements in the recall metrics we calculated for AdsorbML, so it wasnt implemented there. It is, however, a good way to prevent wasteful VASP calculations. You can also imagine that if we would have enumerated 1000 configs per slab adsorbate combo rather than 100 for AdsorbML, it is more likely that having redundant systems would reduce performance, so its a good thing to keep in mind. This may be done by eye for a small number of systems, but with many systems it is easier to use an automated approach. Here is an example of one such approach, which uses a SOAP descriptor to find similar systems.
+
+```{code-cell} ipython3
+# Extract the configs and their energies
+def deduplicate(configs_for_deduplication: list,
+ adsorbate_binding_index: int,
+ cosine_similarity = 1e-3,
+ ):
+ """
+ A function that may be used to deduplicate similar structures.
+ Among duplicate entries, the one with the lowest energy will be kept.
+
+ Args:
+ configs_for_deduplication: a list of ML relaxed adsorbate-
+ surface configurations.
+ cosine_similarity: The cosine simularity value above which,
+ configurations are considered duplicate.
+
+ Returns:
+ (list): the indices of configs which should be kept as non-duplicate
+ """
+
+ energies_for_deduplication = np.array([atoms.get_potential_energy() for atoms in configs_for_deduplication])
+ # Instantiate the soap descriptor
+ soap = SOAP(
+ species=np.unique(configs_for_deduplication[0].get_chemical_symbols()),
+ r_cut = 2.0,
+ n_max=6,
+ l_max=3,
+ periodic=True,
+ )
+ #Figure out which index cooresponds to
+ ads_len = list(configs_for_deduplication[0].get_tags()).count(2)
+ position_idx = -1*(ads_len-adsorbate_binding_index)
+ # Iterate over the systems to get the SOAP vectors
+ soap_desc = []
+ for config in configs_for_deduplication:
+ soap_ex = soap.create(config, centers=[position_idx])
+ soap_desc.extend(soap_ex)
+
+ soap_descs = np.vstack(soap_desc)
+
+ #Use euclidean distance to assess similarity
+ distance = squareform(pdist(soap_descs, metric="cosine"))
+
+ bool_matrix = np.where(distance <= cosine_similarity, 1, 0)
+ # For configs that are found to be similar, just keep the lowest energy one
+ idxs_to_keep = []
+ pass_idxs = []
+ for idx, row in enumerate(bool_matrix):
+ if idx in pass_idxs:
+ continue
+
+ elif sum(row) == 1:
+ idxs_to_keep.append(idx)
+ else:
+ same_idxs = [row_idx for row_idx, val in enumerate(row) if val == 1]
+ pass_idxs.extend(same_idxs)
+ # Pick the one with the lowest energy by ML
+ min_e = min(energies_for_deduplication[same_idxs])
+ idxs_to_keep.append(list(energies_for_deduplication).index(min_e))
+ return idxs_to_keep
+```
+
+```{code-cell} ipython3
+configs_for_deduplication = df.relaxed_atoms.tolist()
+idxs_to_keep = deduplicate(configs_for_deduplication, adsorbate.binding_indices[0])
+```
+
+```{code-cell} ipython3
+# Flip through your configurations to check them out (and make sure deduplication looks good)
+print(idxs_to_keep)
+view_x3d_n(configs_for_deduplication[2].repeat((2,2,1)))
+```
+
+```{code-cell} ipython3
+df = df.iloc[idxs_to_keep]
+```
+
+```{code-cell} ipython3
+low_e_values = np.round(df.sort_values(by = "relaxed_energy_ml").relaxed_energy_ml.tolist()[0:5],3)
+print(f"The lowest 5 energies are: {low_e_values}")
+df
+```
+
+## Write VASP input files
+
+This assumes you have access to VASP pseudopotentials. The default VASP flags (which are equivalent to those used to make OC20) are located in `ocdata.utils.vasp`. Alternatively, you may pass your own vasp flags to the `write_vasp_input_files` function as `vasp_flags`
+
+```{code-cell} ipython3
+# Grab the 5 systems with the lowest energy
+configs_for_dft = df.sort_values(by = "relaxed_energy_ml").relaxed_atoms.tolist()[0:5]
+config_idxs = df.sort_values(by = "relaxed_energy_ml").relaxation_idx.tolist()[0:5]
+
+# Write the inputs
+for idx, config in enumerate(configs_for_dft):
+ os.mkdir(f"data/{config_idxs[idx]}")
+ write_vasp_input_files(config, outdir = f"data/{config_idxs[idx]}/")
+```
diff --git a/_sources/tutorials/intro.md b/_sources/tutorials/intro.md
index 36e4a16b3..0aa78a2fe 100644
--- a/_sources/tutorials/intro.md
+++ b/_sources/tutorials/intro.md
@@ -42,7 +42,7 @@ The [Open Catalyst Project (OCP)](https://github.com/Open-Catalyst-Project) is a
### Models
-OCP provides several [models](../core/MODELS). Each model represents a different approach to featurization, and a different machine learning architecture. The models can be used for different tasks, and you will find different checkpoints associated with different datasets and tasks.
+OCP provides several [models](../core/models). Each model represents a different approach to featurization, and a different machine learning architecture. The models can be used for different tasks, and you will find different checkpoints associated with different datasets and tasks.
+++
diff --git a/autoapi/index.html b/autoapi/index.html
index dd3aa825e..919bd400d 100644
--- a/autoapi/index.html
+++ b/autoapi/index.html
@@ -62,7 +62,7 @@
-
+
@@ -179,11 +179,9 @@
There are multiple ways to train and evaluate OCP models on data other than OC20 and OC22. Writing an LMDB is the most performant option. However, ASE-based dataset formats are also included as a convenience for people with existing data who simply want to try OCP tools without needing to learn about LMDBs.
If your data is already in an ASE Database, no additional preprocessing is necessary before running training/prediction! Although the ASE DB backends may not be sufficiently high throughput for all use cases, they are generally considered “fast enough” to train on a reasonably-sized dataset with 1-2 GPUs or predict with a single GPU. If you want to effictively utilize more resources than this, please be aware of the potential for this bottleneck and consider writing your data to an LMDB. If your dataset is small enough to fit in CPU memory, use the keep_in_memory:True option to avoid this bottleneck.
+
To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset:
+
dataset:
+format:ase_db
+train:
+src:# The path/address to your ASE DB
+connect_args:
+# Keyword arguments for ase.db.connect()
+select_args:
+# Keyword arguments for ase.db.select()
+# These can be used to query/filter the ASE DB
+a2g_args:
+r_energy:True
+r_forces:True
+# Set these if you want to train on energy/forces
+# Energy/force information must be in the ASE DB!
+keep_in_memory:False# Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets!
+include_relaxed_energy:False# Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
+val:
+src:
+a2g_args:
+r_energy:True
+r_forces:True
+test:
+src:
+a2g_args:
+r_energy:False
+r_forces:False
+# It is not necessary to have energy or forces if you are just making predictions.
+
It is possible to train/predict directly on ASE-readable files. This is only recommended for smaller datasets, as directories of many small files do not scale efficiently on all computing infrastructures. There are two options for loading data with the ASE reader:
This dataset assumes a single structure will be obtained from each file:
+
dataset:
+format:ase_read
+train:
+src:# The folder that contains ASE-readable files
+pattern:# Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif".
+include_relaxed_energy:False# Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
+
+ase_read_args:
+# Keyword arguments for ase.io.read()
+a2g_args:
+# Include energy and forces for training purposes
+# If True, the energy/forces must be readable from the file (ex. OUTCAR)
+r_energy:True
+r_forces:True
+keep_in_memory:False
+
This dataset supports reading files that each contain multiple structure (for example, an ASE .traj file). Using an index file, which tells the dataset how many structures each file contains, is recommended. Otherwise, the dataset is forced to load every file at startup and count the number of structures!
+
dataset:
+format:ase_read_multi
+train:
+index_file: Filepath to an index file which contains each filename and the number of structures in each file. e.g.:
+/path/to/relaxation1.traj 200
+/path/to/relaxation2.traj 150
+...
+
+# If using an index file, the src and pattern are not necessary
+src:# The folder that contains ASE-readable files
+pattern:# Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz".
+
+ase_read_args:
+# Keyword arguments for ase.io.read()
+a2g_args:
+# Include energy and forces for training purposes
+r_energy:True
+r_forces:True
+keep_in_memory:False
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Contents
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/core/datasets/oc20.html b/core/datasets/oc20.html
index ffa308f2c..ca63a41c4 100644
--- a/core/datasets/oc20.html
+++ b/core/datasets/oc20.html
@@ -62,7 +62,7 @@
-
+
@@ -179,11 +179,9 @@
Fine tuning a modelMODELS the GemNet-OC OC20+OC22 combination has an energy MAE of 0.483 which seems like a good place to start. This model was trained on oxides.
+
First we get the checkpoint that we want. According to the MODELS the GemNet-OC OC20+OC22 combination has an energy MAE of 0.483 which seems like a good place to start. This model was trained on oxides.
Now since we have a file, we can find the training results in it. See train.txt. At the top, the config is printed, so we can get the checkpoint directory. I use shell commands and Python to get the line, split and strip it here.
It is a good idea to redirect the output to a file. If the output gets too large here, the notebook may fail to save. Normally I would use a redirect like 2&>1, but this does not work with the main.py method. An alternative here is to open a terminal and run it there.
Storing your data in an LMDB ensures very fast random read speeds for the fastest supported throughput. This is the recommended option for the majority of OCP use cases. For more information about writing your data to an LMDB, please see the LMDB Dataset Tutorial.
This notebook provides an overview of how to create LMDB datasets to be used with the OCP repo. This tutorial is intended for those who wish to use OCP to train on their own datasets. Those interested in just using OCP data need not worry about these steps as they’ve been automated as part of the download script: https://github.com/Open-Catalyst-Project/ocp/blob/master/scripts/download_data.py.
Initial Structure to Relaxed Energy/Structure (IS2RE/IS2RS) LMDBs#
IS2RE/IS2RS LMDBs utilize the SinglePointLmdb dataset. This dataset expects the data to be contained in a SINGLE LMDB file. In addition to the attributes defined by AtomsToGraph, the following attributes must be added for the IS2RE/IS2RS tasks:
pos_relaxed: Relaxed adslab positions
@@ -653,7 +655,7 @@
Initial Structure to Relaxed Energy/Structure (IS2RE/IS2RS) LMDBs
-
S2EF LMDBs utilize the TrajectoryLmdb dataset. This dataset expects a directory of LMDB files. In addition to the attributes defined by AtomsToGraph, the following attributes must be added for the S2EF task:
TrajectoryLmdbDataset supports multiple LMDB files because the need to highly parallelize the dataset construction process. With OCP’s largest split containing 135M+ frames, the need to parallelize the LMDB generation process for these was necessary. If you find yourself needing to deal with very large datasets we recommend parallelizing this process.
There are multiple ways to train and evaluate OCP models on data other than OC20 and OC22. Writing an LMDB is the most performant option. However, ASE-based dataset formats are also included as a convenience for people with existing data who simply want to try OCP tools without needing to learn about LMDBs.
-
This tutorial will briefly discuss the basic use of these dataset formats. For more detailed information about the ASE datasets, see the source code and docstrings.
Storing your data in an LMDB ensures very fast random read speeds for the fastest supported throughput. This is the recommended option for the majority of OCP use cases. For more information about writing your data to an LMDB, please see the LMDB Dataset Tutorial.
If your data is already in an ASE Database, no additional preprocessing is necessary before running training/prediction! Although the ASE DB backends may not be sufficiently high throughput for all use cases, they are generally considered “fast enough” to train on a reasonably-sized dataset with 1-2 GPUs or predict with a single GPU. If you want to effictively utilize more resources than this, please be aware of the potential for this bottleneck and consider writing your data to an LMDB. If your dataset is small enough to fit in CPU memory, use the keep_in_memory:True option to avoid this bottleneck.
-
To use this dataset, we will just have to change our config files to use the ASE DB Dataset rather than the LMDB Dataset:
-
dataset:
-format:ase_db
-train:
-src:# The path/address to your ASE DB
-connect_args:
-# Keyword arguments for ase.db.connect()
-select_args:
-# Keyword arguments for ase.db.select()
-# These can be used to query/filter the ASE DB
-a2g_args:
-r_energy:True
-r_forces:True
-# Set these if you want to train on energy/forces
-# Energy/force information must be in the ASE DB!
-keep_in_memory:False# Keeping the dataset in memory reduces random reads and is extremely fast, but this is only feasible for relatively small datasets!
-include_relaxed_energy:False# Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
-val:
-src:
-a2g_args:
-r_energy:True
-r_forces:True
-test:
-src:
-a2g_args:
-r_energy:False
-r_forces:False
-# It is not necessary to have energy or forces if you are just making predictions.
-
It is possible to train/predict directly on ASE-readable files. This is only recommended for smaller datasets, as directories of many small files do not scale efficiently on all computing infrastructures. There are two options for loading data with the ASE reader:
This dataset assumes a single structure will be obtained from each file:
-
dataset:
-format:ase_read
-train:
-src:# The folder that contains ASE-readable files
-pattern:# Pattern matching each file you want to read (e.g. "*/POSCAR"). Search recursively with two wildcards: "**/*.cif".
-include_relaxed_energy:False# Read the last structure's energy and save as "y_relaxed" for IS2RE-Direct training
-
-ase_read_args:
-# Keyword arguments for ase.io.read()
-a2g_args:
-# Include energy and forces for training purposes
-# If True, the energy/forces must be readable from the file (ex. OUTCAR)
-r_energy:True
-r_forces:True
-keep_in_memory:False
-
This dataset supports reading files that each contain multiple structure (for example, an ASE .traj file). Using an index file, which tells the dataset how many structures each file contains, is recommended. Otherwise, the dataset is forced to load every file at startup and count the number of structures!
-
dataset:
-format:ase_read_multi
-train:
-index_file: Filepath to an index file which contains each filename and the number of structures in each file. e.g.:
-/path/to/relaxation1.traj 200
-/path/to/relaxation2.traj 150
-...
-
-# If using an index file, the src and pattern are not necessary
-src:# The folder that contains ASE-readable files
-pattern:# Pattern matching each file you want to read (e.g. "*.traj"). Search recursively with two wildcards: "**/*.xyz".
-
-ase_read_args:
-# Keyword arguments for ase.io.read()
-a2g_args:
-# Include energy and forces for training purposes
-r_energy:True
-r_forces:True
-keep_in_memory:False
-
Python library for programmatic use of the Open Catalyst Demo. Users unfamiliar with the Open Catalyst Demo are encouraged to read more about it before continuing.
The following examples are used to search for *OH binding sites on Pt surfaces. They use the find_adsorbate_binding_sites function, which is a high-level workflow on top of other methods included in this library. Once familiar with this routine, users are encouraged to learn about lower-level methods and features that support more advanced use cases.
This package relies heavily on asyncio. The examples throughout this document can be copied to a python repl launched with:
+
+
+
%%sh
+$ python -m asyncio
+
+
+
+
+
sh: 1: $: not found
+
+
+
---------------------------------------------------------------------------
+CalledProcessErrorTraceback (most recent call last)
+CellIn[2],line1
+----> 1get_ipython().run_cell_magic('sh','','$ python -m asyncio\n')
+
+File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2541, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
+2539withself.builtin_trap:
+2540args=(magic_arg_s,cell)
+-> 2541result=fn(*args,**kwargs)
+2543# The code below prevents the output from being displayed
+2544# when using magics with decorator @output_can_be_silenced
+2545# when the last Python token in the expression is a ';'.
+2546ifgetattr(fn,magic.MAGIC_OUTPUT_CAN_BE_SILENCED,False):
+
+File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/IPython/core/magics/script.py:155, in ScriptMagics._make_script_magic.<locals>.named_script_magic(line, cell)
+153else:
+154line=script
+--> 155returnself.shebang(line,cell)
+
+File /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/IPython/core/magics/script.py:315, in ScriptMagics.shebang(self, line, cell)
+310ifargs.raise_errorandp.returncode!=0:
+311# If we get here and p.returncode is still None, we must have
+312# killed it but not yet seen its return code. We don't wait for it,
+313# in case it's stuck in uninterruptible sleep. -9 = SIGKILL
+314rc=p.returncodeor-9
+--> 315raiseCalledProcessError(rc,cell)
+
+CalledProcessError: Command 'b'$ python -m asyncio\n'' returned non-zero exit status 127.
+
+
+
+
+
Alternatively, an async function can be run in a script by wrapping it with asyncio.run():
Users will be prompted to select one or more surfaces that should be relaxed.
+
Input to this function includes:
+
+
The name of the adsorbate to place
+
A unique ID of the bulk structure from which surfaces will be generated
+
+
This function will perform the following steps:
+
+
Enumerate surfaces of the bulk material
+
On each surface, enumerate initial guesses for adorbate binding sites
+
Run local force-based relaxations of each adsorbate placement
+
+
In addition, this handles:
+
+
Retrying failed calls to the Open Catalyst Demo API
+
Retrying submission of relaxations when they are rate limited
+
+
This should take 2-10 minutes to finish while tens to hundreds (depending on the number of surfaces that are selected) of individual adsorbate placements are relaxed on unique surfaces of Pt. Each of the objects in the returned list includes (among other details):
+
+
Information about the surface being searched, including its structure and Miller indices
+
The initial positions of the adsorbate before relaxation
+
The final structure after relaxation
+
The predicted energy of the final structure
+
The predicted force on each atom in the final structure
Results should be saved whenever possible in order to avoid expensive recomputation.
+
Assuming results was generated with the find_adsorbate_binding_sites method used above, it is an AdsorbateBindingSites object. This can be saved to file with:
Relaxation results can be viewed in a web UI. For example, https://open-catalyst.metademolab.com/results/7eaa0d63-83aa-473f-ac84-423ffd0c67f5 shows the results of relaxing *OH on a Pt (1,1,1) surface; the uuid, “7eaa0d63-83aa-473f-ac84-423ffd0c67f5”, is referred to as the system_id.
+
Extending the examples above, the URLs to visualize the results of relaxations on each Pt surface can be obtained with:
Calls to find_adsorbate_binding_sites() will, by default, show the user all pending relaxations and ask for approval before they are submitted. In order to run the relaxations automatically without manual approval, adslab_filter can be set to a function that automatically approves any or all adsorbate/slab (adslab) configurations.
Important! The to_ase_atoms() method described below will fail with an import error if ase is not installed.
+
Two classes have support for generating ase.Atoms objects:
+
+
ocpapi.Atoms.to_ase_atoms(): Adds unit cell, atomic positions, and other structural information to the returned ase.Atoms object.
+
ocpapi.AdsorbateSlabRelaxationResult.to_ase_atoms(): Adds the same structure information to the ase.Atoms object. Also adds the predicted forces and energy of the relaxed structure, which can be accessed with the ase.Atoms.get_potential_energy() and ase.Atoms.get_forces() methods.
+
+
For example, the following would generate an ase.Atoms object for the first relaxed adsorbate configuration on the first slab generated for *OH binding on Pt:
From an ase.Atoms object (see previous section), is is possible to write to other structure formats. Extending the example above, the ase_atoms object could be written to a VASP POSCAR file with:
If you use ocpapi in your research, please consider citing the AdsorbML paper (in addition to the relevant datasets / models used):
+
@article{lan2023adsorbml,
+title={{AdsorbML}: a leap in efficiency for adsorption energy calculations using generalizable machine learning potentials},
+author={Lan*, Janice and Palizhati*, Aini and Shuaibi*, Muhammed and Wood*, Brandon M and Wander, Brook and Das, Abhishek and Uyttendaele, Matt and Zitnick, C Lawrence and Ulissi, Zachary W},
+journal={npj Computational Materials},
+year={2023},
+}
+