Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues related to data preprocessing of datasets #61

Open
zzutao opened this issue Jul 30, 2024 · 6 comments
Open

Issues related to data preprocessing of datasets #61

zzutao opened this issue Jul 30, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers

Comments

@zzutao
Copy link

zzutao commented Jul 30, 2024

Hello.
After downloading the dataset from the website https://figshare.com/articles/dataset/MPF_2021_2_8/19470599 and merging the two .p files, you are on the right track to prepare the data for model training by converting it into a list of Atoms (ase) type and then using the sevenn_graph_build command to generate a sevenn_data type file. In this process, after parsing the dictionary to ['structure ',' energy ',' force ',' stress', 'id'], there is some ambiguity when using these field information to instantiate Atoms objects. May I ask if there are any relevant documentation or programs that can generate datasets that can be processed by the seven_graph_build program.
Thank you.

@YutackPark
Copy link
Member

SevenNet tries to read 'free_energy' first, and if 'free_energy' is not available, use 'energy'.

Internally, 'free_energy' is obtained from ase.atoms, with the below code:

E = atoms.get_potential_energy(force_consistent=True)

For the MPF dataset, you can check the consistency of your preprocessing script, which converts MPF dataset entry to ASE atoms, by comparing its results with other ASE atoms instance initialized from VASP OUTCAR.

It makes sense as the author of MPF dataset says values are raw outputs of VASP. You can create ASE atoms instance from energy, force, and stress like:

from ase.atoms import Atoms
from ase.calculators.singlepoint import SinglePointCalculator

atom = Atoms(species, pos, cell=cell, pbc=True)                               
calc_results = {"energy": energy,
                "free_energy": energy,                                        
                "forces": force,                                              
                "stress": stress}
calculator = SinglePointCalculator(atom, **calc_results)
atom = calculator.get_atoms()

The MPF dataset is a special case, because it is not originated from MD software.

sevenn_graph_build can work with any ASE readable data. Here's the relevant document of ASE: https://wiki.fysik.dtu.dk/ase/ase/io/io.html

Write your ASE atoms object to 'extxyz' format. It can be directly passed to sevenn_graph_build, for instance:

sevenn_graph_build --format ase my_data.extxyz 5.0

First positional argument is a file name, and the second is a cutoff radius of the model.

We're planning to write tutorials with pure python! Before that, I think it is better not to close this issue.

@YutackPark YutackPark added documentation Improvements or additions to documentation good first issue Good for newcomers labels Jul 31, 2024
@zzutao
Copy link
Author

zzutao commented Aug 1, 2024

Hello!
Thank you very much for your reply! Based on your guidance, I attempted to preprocess the MPF dataset. Although the program did not report any errors and was able to build the sevenn_data file, when using the sevenn_data file for model training, I felt that the data was a bit strange and the model was stuck in the first epoch.(After about an hour of model training, on the second epoch.)
Due to the fact that the values of 'structure', 'energy', 'force', 'stress', and' id 'are all lists of length 3, I am confused and will try to traverse the list to parse and initialize Atoms. Another point of confusion is that for atoms. get-potentiated_energy (force_comsistent=True), a calculator needs to be set before it can be executed.
Here is the main program I wrote to parse data:
`data = merge_data('block_0.p', 'block_1.p')

atoms_list = []

for material_id, snapshots in data.items():
    # print(type(snapshots),  snapshots.keys())
    # ['structure', 'energy', 'force', 'stress', 'id']
    snapshot_ids = snapshots['id']
    stresses = snapshots['stress']
    forces = snapshots['force']
    energies = snapshots['energy']
    structures = snapshots['structure']
    # print(snapshot_ids,len(snapshot_ids), stresses,len(stresses), forces,len(forces), energies,len(energies), len(structures))
    for i in range(len(structures)):
        lattice_matrix = structures[i].lattice.matrix  
        symbols = [site.specie.symbol for site in structures[i].sites]  
        positions = structures[i].cart_coords
        atoms = Atoms(symbols=symbols, positions=positions, cell=lattice_matrix, pbc=True)
        # E = atoms.get_potential_energy(force_consistent=True)
        calc_results = {"energy": energies[i],
            # "free_energy": E,                                        
            "forces": np.array(forces[i]),                                              
            "stress": -0.1 * np.array(stresses[i])}
        calculator = SinglePointCalculator(atoms, **calc_results)
        atoms = calculator.get_atoms()
        atoms_list.append(atoms)

filename = "my_data.extxyz"
ase.io.write(filename, atoms_list, format='extxyz')`

When you have free time, please criticize and correct my code, and once again express my gratitude to you.

Appendix Print Output

Number of atoms in the train_set:
my_data             : {'Na': 68246, 'Cd': 16692, 'Sn': 29901, 'S': 152447, 'Li': 190588, 'Sb': 33159, 'P': 171506, 'O': 1954453, \
                       'Ca': 42760, 'Ti': 42445, 'Al': 46200, 'F': 233412, 'In': 19901, 'Br': 35584, 'Yb': 8698, 'Ir': 6556, \
                       'Cl': 88268, 'I': 42909, 'La': 25199, 'Ru': 9133, 'Fe': 75727, 'Si': 91610, 'Zr': 15633, 'Nb': 23578, \
                       'Eu': 4339, 'Cs': 18481, 'V': 56153, 'Ge': 32183, 'Cr': 34671, 'Ni': 48364, 'Bi': 34529, 'Rb': 24631, \
                       'Au': 10197, 'Mg': 36581, 'B': 72706, 'Mn': 74439, 'Cu': 47056, 'Sr': 29143, 'Te': 39865, 'Ba': 41586, \
                       'Pu': 1789, 'Pb': 14927, 'Co': 45122, 'Hf': 8235, 'Rh': 10637, 'Y': 17820, 'Ta': 15483, 'W': 23788, \
                       'Se': 70931, 'Th': 3503, 'Pa': 656, 'Hg': 12878, 'Zn': 36152, 'Mo': 33264, 'Pt': 9696, 'Pr': 10852, \
                       'Sc': 9187, 'N': 98904, 'Np': 1018, 'C': 95831, 'Be': 7669, 'K': 49168, 'Gd': 6352, 'Ag': 20087, \
                       'Pd': 12941, 'Nd': 11779, 'H': 249976, 'Tl': 14381, 'Os': 4737, 'Tm': 5687, 'Dy': 7753, 'As': 28674, \
                       'Pm': 987, 'Lu': 6125, 'Ce': 11506, 'Sm': 9906, 'Er': 8448, 'Tb': 7724, 'Ga': 22489, 'U': 9162, 'Ho': 8075, \
                       'Xe': 1246, 'Re': 7530, 'Ac': 669, 'Tc': 1722, 'He': 48, 'Kr': 169, 'Ar': 9, 'Ne': 3}
Total, label wise   : {'Na': 68246, 'Cd': 16692, 'Sn': 29901, 'S': 152447, 'Li': 190588, 'Sb': 33159, 'P': 171506, 'O': 1954453, \
                       'Ca': 42760, 'Ti': 42445, 'Al': 46200, 'F': 233412, 'In': 19901, 'Br': 35584, 'Yb': 8698, 'Ir': 6556, \
                       'Cl': 88268, 'I': 42909, 'La': 25199, 'Ru': 9133, 'Fe': 75727, 'Si': 91610, 'Zr': 15633, 'Nb': 23578, \
                       'Eu': 4339, 'Cs': 18481, 'V': 56153, 'Ge': 32183, 'Cr': 34671, 'Ni': 48364, 'Bi': 34529, 'Rb': 24631, \
                       'Au': 10197, 'Mg': 36581, 'B': 72706, 'Mn': 74439, 'Cu': 47056, 'Sr': 29143, 'Te': 39865, 'Ba': 41586, \
                       'Pu': 1789, 'Pb': 14927, 'Co': 45122, 'Hf': 8235, 'Rh': 10637, 'Y': 17820, 'Ta': 15483, 'W': 23788, \
                       'Se': 70931, 'Th': 3503, 'Pa': 656, 'Hg': 12878, 'Zn': 36152, 'Mo': 33264, 'Pt': 9696, 'Pr': 10852, \
                       'Sc': 9187, 'N': 98904, 'Np': 1018, 'C': 95831, 'Be': 7669, 'K': 49168, 'Gd': 6352, 'Ag': 20087, \
                       'Pd': 12941, 'Nd': 11779, 'H': 249976, 'Tl': 14381, 'Os': 4737, 'Tm': 5687, 'Dy': 7753, 'As': 28674, \
                       'Pm': 987, 'Lu': 6125, 'Ce': 11506, 'Sm': 9906, 'Er': 8448, 'Tb': 7724, 'Ga': 22489, 'U': 9162, 'Ho': 8075, \
                       'Xe': 1246, 'Re': 7530, 'Ac': 669, 'Tc': 1722, 'He': 48, 'Kr': 169, 'Ar': 9, 'Ne': 3}
Total               : 5065224
------------------------------------------------------------------------------------------------------------------------
Per atom energy(eV/atom) distribution:
my_data             : {'mean': '-5.975', 'std': '1.863', 'median': '-6.192', 'max': '49.575', 'min': '-28.731'}
Total               : {'mean': '-5.975', 'std': '1.863', 'median': '-6.192', 'max': '49.575', 'min': '-28.731'}
------------------------------------------------------------------------------------------------------------------------
Force(eV/Angstrom) distribution:
my_data             : {'mean': '-0.000', 'std': '3.369', 'median': '0.000', 'max': '2552.991', 'min': '-2570.567'}
Total               : {'mean': '-0.000', 'std': '3.369', 'median': '0.000', 'max': '2552.991', 'min': '-2570.567'}
------------------------------------------------------------------------------------------------------------------------
Stress(eV/Angstrom^3) distribution:
my_data             : {'mean': '1.258', 'std': '30.459', 'median': '0.000', 'max': '5474.488', 'min': '-1397.567'}
Total               : {'mean': '1.258', 'std': '30.459', 'median': '0.000', 'max': '5474.488', 'min': '-1397.567'}
------------------------------------------------------------------------------------------------------------------------
training_set size   : {'my_data': 168919}
validation_set size : {'my_data': 18768}

Calculating statistic values from dataset
Average # of neighbors: 36.994069
Use global shift, scale
shift, scale        : -5.975250, 3.368628
(1st) conv_denominator is: 36.994069
Shuffle the train data

@YutackPark
Copy link
Member

Due to the fact that the values of 'structure', 'energy', 'force', 'stress', and' id 'are all lists of length 3, I am confused and will try to traverse the list to parse and initialize Atoms.

MPF dataset samples three structures per relaxation trajectory. I recommend you to check their paper for details.

"stress": -0.1 * np.array(stresses[i])}

Instead of multiplying "-0.1", try this "stress = -1 * stress / 1602.1766208 # to eV/Angstrom^3"

Standard ASE atoms instance may have eV/Angstrom^3 units for its stress. As I mentioned, the best way to ensure the script is to compare your result with the outputs of ase.io.read of any VASP OUTCAR with non-zero stress. Verify your preprocessing code before move on! Alternatively, I think it is reasonable to raise issue or discussion in M3GNet github to ask the script that converts their structure into ASE.atoms.

SevenNet log file seems good. As the dataset has ~188K structures the slow training speed you observed is expected. Sadly, pre-training SevenNet-0 is a computationally demanding task. I recommend using multi-GPU training. Cheers!

@zzutao
Copy link
Author

zzutao commented Aug 2, 2024

Thank you very much for your explanation and suggestions!

@thangckt
Copy link
Contributor

hi @YutackPark
can you have a little guide on how to efficiently handle a large json MPTraj dataset? or how to convert it to ase atoms?

Thank you so much.

@YutackPark
Copy link
Member

hi @YutackPark can you have a little guide on how to efficiently handle a large json MPTraj dataset? or how to convert it to ase atoms?

Thank you so much.

Here's the code:
https://github.com/janosh/matbench-discovery/blob/main/models/sevennet/train_sevennet/convert_mptrj_to_xyz.py

Note that while the code splits dataset into train, valid, and test, SevenNet-0 used all the data in MPTrj without splitting.

Handling large dataset is another problem. (#88) A preprocessed graph (.sevenn_data) might not fit into the memory. There is an experimental feature I'm currently working on: https://github.com/MDIL-SNU/SevenNet/tree/ase_db
It uses ase db to dynamically load atoms from disk and build graph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants