-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement fingerprint functionality #47
Conversation
scripts/fingerprint_snapshot.py
Outdated
comp_fingerprints = cdvae_cov_comp_fingerprints(mpt.inputs) | ||
struct_fingerprints = cdvae_cov_struct_fingerprints(mpt.inputs) | ||
|
||
# save the fingerprints and upload to figshare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pypi.org/project/pystow/ might be useful for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hadn't heard of this one. Looking into it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also https://github.com/cthoyt/zenodo-client/! but obv this is for zenodo instead of figshare. This package completely automates the update and versioning cycle for zenodo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cthoyt thanks! I'll need to check out hosting datasets (instead of just GitHub repo snapshots) via Zenodo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using GitHub for hosting big datasets (like bigger than 100mb) isn’t so great since you go into LFS territory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cthoyt agreed, I've typically been using FigShare for datasets over ~10 MB, and strongly avoid LFS (which of course has its advantages, just not a good fit for me usually). I've been using Zenodo to get DOIs for my GitHub repos (i.e. DOI the code) but hadn't thought of using it directly for dataset files. Zenodo seems really nice - large dataset limits, code formatting support, etc. I think I'll try it out next time. Aside: maybe a comparable package for FigShare
@@ -78,7 +86,7 @@ def cdvae_cov_struct_fingerprints(structures, verbose=False): | |||
my_tqdm = get_tqdm(verbose) | |||
struct_fps = [] | |||
# base_10_check = [10 ** j for j in range(0, 20)] | |||
for i, s in enumerate(my_tqdm(structures)): | |||
for s in my_tqdm(structures): | |||
# if i in base_10_check == 0: | |||
# logger.info(f"{time()} Struct fingerprint {i}/{len(structures)}") | |||
site_fps = [CrystalNNFP.featurize(s, i) for i in range(len(s))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to consider featurize_many
or just use concurrent.futures
and parallelize yourself.
Also, there is a SiteStatsFingerprint
https://hackingmaterials.lbl.gov/matminer/matminer.featurizers.structure.html#matminer.featurizers.structure.sites.SiteStatsFingerprint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with featurize_dataframe
. Thanks for pointing these out! VS Code+debugging+multiprocessing doesn't play nicely together unless the call to parallel code is nested inside a if __name__ == "__main__":
statement, so you might see those crop up in a few places.
verbose=False, | ||
**match_kwargs, | ||
): | ||
if match_type == "cdvae_coverage": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you were to add the hashes, it would be something like:
from structuregraph_helpers.cli import create_hashes_for_structure
import concurrent.futures
# get an OrderedDict of different hash "flavors"
all_hashes = []
with concurrent.futures.ProcessPoolExecutor(max_workers=n_jobs) as executor:
for hashes, name in zip(executor.map(create_hashes_for_structure, structures), names)
# the identifier "name" might also just be an index. However, it is good to have one
# with this multiprocessing stuff
hashes['name'] = name
all_hashes.append(hashes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the fingerprinting, do you think this would speed things up overall? (given that CrystalNN
is used for the structural fingerprints atm). For StructureMatcher
, I think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to optimize for speed, I'd go with a cutoff-based method for computing the graph (e.g. the CutoffDictNN
with the vesta presets). The voronoi tesselations tend to be slow for large structures
@kjappelbaum Planning to add you and Berend as co-authors on the figshare dataset. |
for latter, if fingerprints are not already calculated
WIP: loading precomputed compositional and structural fingerprints
#39 (comment)