implement fingerprint functionality #47

sgbaird · 2022-08-05T05:38:03Z

WIP: loading precomputed compositional and structural fingerprints

kjappelbaum · 2022-08-05T11:38:44Z

scripts/fingerprint_snapshot.py

+comp_fingerprints = cdvae_cov_comp_fingerprints(mpt.inputs)
+struct_fingerprints = cdvae_cov_struct_fingerprints(mpt.inputs)
+
+# save the fingerprints and upload to figshare


https://pypi.org/project/pystow/ might be useful for this

Hadn't heard of this one. Looking into it. Thanks!

also https://github.com/cthoyt/zenodo-client/! but obv this is for zenodo instead of figshare. This package completely automates the update and versioning cycle for zenodo

@cthoyt thanks! I'll need to check out hosting datasets (instead of just GitHub repo snapshots) via Zenodo

Yeah, using GitHub for hosting big datasets (like bigger than 100mb) isn’t so great since you go into LFS territory

@cthoyt agreed, I've typically been using FigShare for datasets over ~10 MB, and strongly avoid LFS (which of course has its advantages, just not a good fit for me usually). I've been using Zenodo to get DOIs for my GitHub repos (i.e. DOI the code) but hadn't thought of using it directly for dataset files. Zenodo seems really nice - large dataset limits, code formatting support, etc. I think I'll try it out next time. Aside: maybe a comparable package for FigShare

kjappelbaum · 2022-08-05T11:42:04Z

src/matbench_genmetrics/utils/match.py

@@ -78,7 +86,7 @@ def cdvae_cov_struct_fingerprints(structures, verbose=False):
    my_tqdm = get_tqdm(verbose)
    struct_fps = []
    # base_10_check = [10 ** j for j in range(0, 20)]
-    for i, s in enumerate(my_tqdm(structures)):
+    for s in my_tqdm(structures):
        # if i in base_10_check == 0:
        #     logger.info(f"{time()} Struct fingerprint {i}/{len(structures)}")
        site_fps = [CrystalNNFP.featurize(s, i) for i in range(len(s))]


you might want to consider featurize_many or just use concurrent.futures and parallelize yourself.
Also, there is a SiteStatsFingerprint https://hackingmaterials.lbl.gov/matminer/matminer.featurizers.structure.html#matminer.featurizers.structure.sites.SiteStatsFingerprint

Went with featurize_dataframe. Thanks for pointing these out! VS Code+debugging+multiprocessing doesn't play nicely together unless the call to parallel code is nested inside a if __name__ == "__main__": statement, so you might see those crop up in a few places.

kjappelbaum · 2022-08-05T11:47:57Z

src/matbench_genmetrics/utils/match.py

+    verbose=False,
+    **match_kwargs,
+):
+    if match_type == "cdvae_coverage":


if you were to add the hashes, it would be something like:

from structuregraph_helpers.cli import create_hashes_for_structure import concurrent.futures # get an OrderedDict of different hash "flavors" all_hashes = [] with concurrent.futures.ProcessPoolExecutor(max_workers=n_jobs) as executor: for hashes, name in zip(executor.map(create_hashes_for_structure, structures), names) # the identifier "name" might also just be an index. However, it is good to have one # with this multiprocessing stuff hashes['name'] = name all_hashes.append(hashes)

For the fingerprinting, do you think this would speed things up overall? (given that CrystalNN is used for the structural fingerprints atm). For StructureMatcher, I think it makes sense.

if you want to optimize for speed, I'd go with a cutoff-based method for computing the graph (e.g. the CutoffDictNN with the vesta presets). The voronoi tesselations tend to be slow for large structures

…print

sgbaird · 2022-08-06T15:17:36Z

@kjappelbaum Planning to add you and Berend as co-authors on the figshare dataset.

for latter, if fingerprints are not already calculated

sgbaird added 4 commits August 4, 2022 23:35

Update 1.0-matbench-genmetrics-basic.ipynb

618a553

WIP: save comp and struct fingerprint snapshots for figshare

01456c6

WIP: implement loading fingerprints

bf7b3b5

split fingerprint vs. StructureMatcher, change kwargs

c981a59

kjappelbaum reviewed Aug 5, 2022

View reviewed changes

sgbaird added 4 commits August 5, 2022 22:55

ignore sites which couldn't be featurized, note about SiteStatsFinger…

f308333

…print

use SiteStatsFingerprint and featurize_dataframe

de7077c

move functions to featurize.py

0be6271

use matminer versions for featurization

4acf637

sgbaird mentioned this pull request Aug 6, 2022

Package on Anaconda via conda-forge? cthoyt/pystow#49

Closed

sgbaird added 5 commits August 6, 2022 11:13

pystow data handling

5cd315e

missing url kwarg

c8c0050

don't return pymatgen objs and don't keep as df by default

3c98471

remove leftover kwargs

84d42b5

logic for symmetric matrices and calculating fingerprints

8a22721

for latter, if fingerprints are not already calculated

sgbaird marked this pull request as ready for review August 6, 2022 20:02

sgbaird mentioned this pull request Aug 6, 2022

Hardcoded tests for fingerprint-based metrics once settled down #50

Open

sgbaird merged commit b398bf2 into main Aug 6, 2022

sgbaird deleted the fingerprint branch August 6, 2022 20:18

sgbaird mentioned this pull request Aug 9, 2022

cheaper matching #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement fingerprint functionality #47

implement fingerprint functionality #47

sgbaird commented Aug 5, 2022

kjappelbaum Aug 5, 2022

sgbaird Aug 5, 2022

cthoyt Aug 6, 2022

sgbaird Aug 6, 2022 •

edited

Loading

cthoyt Aug 6, 2022

sgbaird Aug 6, 2022 •

edited

Loading

kjappelbaum Aug 5, 2022 •

edited by sgbaird

Loading

sgbaird Aug 6, 2022

kjappelbaum Aug 5, 2022

sgbaird Aug 5, 2022

sgbaird Aug 6, 2022

kjappelbaum Aug 6, 2022

sgbaird commented Aug 6, 2022 •

edited

Loading

implement fingerprint functionality #47

implement fingerprint functionality #47

Conversation

sgbaird commented Aug 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgbaird Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgbaird Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

kjappelbaum Aug 5, 2022 • edited by sgbaird Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgbaird commented Aug 6, 2022 • edited Loading

sgbaird Aug 6, 2022 •

edited

Loading

sgbaird Aug 6, 2022 •

edited

Loading

kjappelbaum Aug 5, 2022 •

edited by sgbaird

Loading

sgbaird commented Aug 6, 2022 •

edited

Loading