Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement fingerprint functionality #47

Merged
merged 13 commits into from
Aug 6, 2022
Merged

implement fingerprint functionality #47

merged 13 commits into from
Aug 6, 2022

Conversation

sgbaird
Copy link
Member

@sgbaird sgbaird commented Aug 5, 2022

WIP: loading precomputed compositional and structural fingerprints

#39 (comment)

comp_fingerprints = cdvae_cov_comp_fingerprints(mpt.inputs)
struct_fingerprints = cdvae_cov_struct_fingerprints(mpt.inputs)

# save the fingerprints and upload to figshare

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pypi.org/project/pystow/ might be useful for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't heard of this one. Looking into it. Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also https://github.com/cthoyt/zenodo-client/! but obv this is for zenodo instead of figshare. This package completely automates the update and versioning cycle for zenodo

Copy link
Member Author

@sgbaird sgbaird Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cthoyt thanks! I'll need to check out hosting datasets (instead of just GitHub repo snapshots) via Zenodo

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using GitHub for hosting big datasets (like bigger than 100mb) isn’t so great since you go into LFS territory

Copy link
Member Author

@sgbaird sgbaird Aug 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cthoyt agreed, I've typically been using FigShare for datasets over ~10 MB, and strongly avoid LFS (which of course has its advantages, just not a good fit for me usually). I've been using Zenodo to get DOIs for my GitHub repos (i.e. DOI the code) but hadn't thought of using it directly for dataset files. Zenodo seems really nice - large dataset limits, code formatting support, etc. I think I'll try it out next time. Aside: maybe a comparable package for FigShare

@@ -78,7 +86,7 @@ def cdvae_cov_struct_fingerprints(structures, verbose=False):
my_tqdm = get_tqdm(verbose)
struct_fps = []
# base_10_check = [10 ** j for j in range(0, 20)]
for i, s in enumerate(my_tqdm(structures)):
for s in my_tqdm(structures):
# if i in base_10_check == 0:
# logger.info(f"{time()} Struct fingerprint {i}/{len(structures)}")
site_fps = [CrystalNNFP.featurize(s, i) for i in range(len(s))]
Copy link

@kjappelbaum kjappelbaum Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to consider featurize_many or just use concurrent.futures and parallelize yourself.
Also, there is a SiteStatsFingerprint https://hackingmaterials.lbl.gov/matminer/matminer.featurizers.structure.html#matminer.featurizers.structure.sites.SiteStatsFingerprint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with featurize_dataframe. Thanks for pointing these out! VS Code+debugging+multiprocessing doesn't play nicely together unless the call to parallel code is nested inside a if __name__ == "__main__": statement, so you might see those crop up in a few places.

verbose=False,
**match_kwargs,
):
if match_type == "cdvae_coverage":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you were to add the hashes, it would be something like:

from structuregraph_helpers.cli import create_hashes_for_structure
import concurrent.futures
# get an OrderedDict of different hash "flavors"

all_hashes = []

with concurrent.futures.ProcessPoolExecutor(max_workers=n_jobs) as executor: 
	for hashes, name in zip(executor.map(create_hashes_for_structure, structures), names)
		# the identifier "name" might also just be an index. However, it is good to have one 
		# with this multiprocessing stuff
		hashes['name'] = name
		all_hashes.append(hashes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the fingerprinting, do you think this would speed things up overall? (given that CrystalNN is used for the structural fingerprints atm). For StructureMatcher, I think it makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to optimize for speed, I'd go with a cutoff-based method for computing the graph (e.g. the CutoffDictNN with the vesta presets). The voronoi tesselations tend to be slow for large structures

@sgbaird
Copy link
Member Author

sgbaird commented Aug 6, 2022

@kjappelbaum Planning to add you and Berend as co-authors on the figshare dataset.

@sgbaird sgbaird marked this pull request as ready for review August 6, 2022 20:02
@sgbaird sgbaird merged commit b398bf2 into main Aug 6, 2022
@sgbaird sgbaird deleted the fingerprint branch August 6, 2022 20:18
@sgbaird sgbaird mentioned this pull request Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants