Issues related to data preprocessing of datasets #61

zzutao · 2024-07-30T11:57:59Z

Hello.
After downloading the dataset from the website https://figshare.com/articles/dataset/MPF_2021_2_8/19470599 and merging the two .p files, you are on the right track to prepare the data for model training by converting it into a list of Atoms (ase) type and then using the sevenn_graph_build command to generate a sevenn_data type file. In this process, after parsing the dictionary to ['structure ',' energy ',' force ',' stress', 'id'], there is some ambiguity when using these field information to instantiate Atoms objects. May I ask if there are any relevant documentation or programs that can generate datasets that can be processed by the seven_graph_build program.
Thank you.

YutackPark · 2024-07-31T12:07:32Z

SevenNet tries to read 'free_energy' first, and if 'free_energy' is not available, use 'energy'.

Internally, 'free_energy' is obtained from ase.atoms, with the below code:

E = atoms.get_potential_energy(force_consistent=True)

For the MPF dataset, you can check the consistency of your preprocessing script, which converts MPF dataset entry to ASE atoms, by comparing its results with other ASE atoms instance initialized from VASP OUTCAR.

It makes sense as the author of MPF dataset says values are raw outputs of VASP. You can create ASE atoms instance from energy, force, and stress like:

from ase.atoms import Atoms
from ase.calculators.singlepoint import SinglePointCalculator

atom = Atoms(species, pos, cell=cell, pbc=True)                               
calc_results = {"energy": energy,
                "free_energy": energy,                                        
                "forces": force,                                              
                "stress": stress}
calculator = SinglePointCalculator(atom, **calc_results)
atom = calculator.get_atoms()

The MPF dataset is a special case, because it is not originated from MD software.

sevenn_graph_build can work with any ASE readable data. Here's the relevant document of ASE: https://wiki.fysik.dtu.dk/ase/ase/io/io.html

Write your ASE atoms object to 'extxyz' format. It can be directly passed to sevenn_graph_build, for instance:

sevenn_graph_build --format ase my_data.extxyz 5.0

First positional argument is a file name, and the second is a cutoff radius of the model.

We're planning to write tutorials with pure python! Before that, I think it is better not to close this issue.

zzutao · 2024-08-01T09:05:05Z

Hello!
Thank you very much for your reply! Based on your guidance, I attempted to preprocess the MPF dataset. Although the program did not report any errors and was able to build the sevenn_data file, when using the sevenn_data file for model training, I felt that the data was a bit strange and the model was stuck in the first epoch.（After about an hour of model training, on the second epoch.）
Due to the fact that the values of 'structure', 'energy', 'force', 'stress', and' id 'are all lists of length 3, I am confused and will try to traverse the list to parse and initialize Atoms. Another point of confusion is that for atoms. get-potentiated_energy (force_comsistent=True), a calculator needs to be set before it can be executed.
Here is the main program I wrote to parse data:
`data = merge_data('block_0.p', 'block_1.p')

atoms_list = []

for material_id, snapshots in data.items():
    # print(type(snapshots),  snapshots.keys())
    # ['structure', 'energy', 'force', 'stress', 'id']
    snapshot_ids = snapshots['id']
    stresses = snapshots['stress']
    forces = snapshots['force']
    energies = snapshots['energy']
    structures = snapshots['structure']
    # print(snapshot_ids,len(snapshot_ids), stresses,len(stresses), forces,len(forces), energies,len(energies), len(structures))
    for i in range(len(structures)):
        lattice_matrix = structures[i].lattice.matrix  
        symbols = [site.specie.symbol for site in structures[i].sites]  
        positions = structures[i].cart_coords
        atoms = Atoms(symbols=symbols, positions=positions, cell=lattice_matrix, pbc=True)
        # E = atoms.get_potential_energy(force_consistent=True)
        calc_results = {"energy": energies[i],
            # "free_energy": E,                                        
            "forces": np.array(forces[i]),                                              
            "stress": -0.1 * np.array(stresses[i])}
        calculator = SinglePointCalculator(atoms, **calc_results)
        atoms = calculator.get_atoms()
        atoms_list.append(atoms)

filename = "my_data.extxyz"
ase.io.write(filename, atoms_list, format='extxyz')`

When you have free time, please criticize and correct my code, and once again express my gratitude to you.

Appendix Print Output

Number of atoms in the train_set:
my_data             : {'Na': 68246, 'Cd': 16692, 'Sn': 29901, 'S': 152447, 'Li': 190588, 'Sb': 33159, 'P': 171506, 'O': 1954453, \
                       'Ca': 42760, 'Ti': 42445, 'Al': 46200, 'F': 233412, 'In': 19901, 'Br': 35584, 'Yb': 8698, 'Ir': 6556, \
                       'Cl': 88268, 'I': 42909, 'La': 25199, 'Ru': 9133, 'Fe': 75727, 'Si': 91610, 'Zr': 15633, 'Nb': 23578, \
                       'Eu': 4339, 'Cs': 18481, 'V': 56153, 'Ge': 32183, 'Cr': 34671, 'Ni': 48364, 'Bi': 34529, 'Rb': 24631, \
                       'Au': 10197, 'Mg': 36581, 'B': 72706, 'Mn': 74439, 'Cu': 47056, 'Sr': 29143, 'Te': 39865, 'Ba': 41586, \
                       'Pu': 1789, 'Pb': 14927, 'Co': 45122, 'Hf': 8235, 'Rh': 10637, 'Y': 17820, 'Ta': 15483, 'W': 23788, \
                       'Se': 70931, 'Th': 3503, 'Pa': 656, 'Hg': 12878, 'Zn': 36152, 'Mo': 33264, 'Pt': 9696, 'Pr': 10852, \
                       'Sc': 9187, 'N': 98904, 'Np': 1018, 'C': 95831, 'Be': 7669, 'K': 49168, 'Gd': 6352, 'Ag': 20087, \
                       'Pd': 12941, 'Nd': 11779, 'H': 249976, 'Tl': 14381, 'Os': 4737, 'Tm': 5687, 'Dy': 7753, 'As': 28674, \
                       'Pm': 987, 'Lu': 6125, 'Ce': 11506, 'Sm': 9906, 'Er': 8448, 'Tb': 7724, 'Ga': 22489, 'U': 9162, 'Ho': 8075, \
                       'Xe': 1246, 'Re': 7530, 'Ac': 669, 'Tc': 1722, 'He': 48, 'Kr': 169, 'Ar': 9, 'Ne': 3}
Total, label wise   : {'Na': 68246, 'Cd': 16692, 'Sn': 29901, 'S': 152447, 'Li': 190588, 'Sb': 33159, 'P': 171506, 'O': 1954453, \
                       'Ca': 42760, 'Ti': 42445, 'Al': 46200, 'F': 233412, 'In': 19901, 'Br': 35584, 'Yb': 8698, 'Ir': 6556, \
                       'Cl': 88268, 'I': 42909, 'La': 25199, 'Ru': 9133, 'Fe': 75727, 'Si': 91610, 'Zr': 15633, 'Nb': 23578, \
                       'Eu': 4339, 'Cs': 18481, 'V': 56153, 'Ge': 32183, 'Cr': 34671, 'Ni': 48364, 'Bi': 34529, 'Rb': 24631, \
                       'Au': 10197, 'Mg': 36581, 'B': 72706, 'Mn': 74439, 'Cu': 47056, 'Sr': 29143, 'Te': 39865, 'Ba': 41586, \
                       'Pu': 1789, 'Pb': 14927, 'Co': 45122, 'Hf': 8235, 'Rh': 10637, 'Y': 17820, 'Ta': 15483, 'W': 23788, \
                       'Se': 70931, 'Th': 3503, 'Pa': 656, 'Hg': 12878, 'Zn': 36152, 'Mo': 33264, 'Pt': 9696, 'Pr': 10852, \
                       'Sc': 9187, 'N': 98904, 'Np': 1018, 'C': 95831, 'Be': 7669, 'K': 49168, 'Gd': 6352, 'Ag': 20087, \
                       'Pd': 12941, 'Nd': 11779, 'H': 249976, 'Tl': 14381, 'Os': 4737, 'Tm': 5687, 'Dy': 7753, 'As': 28674, \
                       'Pm': 987, 'Lu': 6125, 'Ce': 11506, 'Sm': 9906, 'Er': 8448, 'Tb': 7724, 'Ga': 22489, 'U': 9162, 'Ho': 8075, \
                       'Xe': 1246, 'Re': 7530, 'Ac': 669, 'Tc': 1722, 'He': 48, 'Kr': 169, 'Ar': 9, 'Ne': 3}
Total               : 5065224
------------------------------------------------------------------------------------------------------------------------
Per atom energy(eV/atom) distribution:
my_data             : {'mean': '-5.975', 'std': '1.863', 'median': '-6.192', 'max': '49.575', 'min': '-28.731'}
Total               : {'mean': '-5.975', 'std': '1.863', 'median': '-6.192', 'max': '49.575', 'min': '-28.731'}
------------------------------------------------------------------------------------------------------------------------
Force(eV/Angstrom) distribution:
my_data             : {'mean': '-0.000', 'std': '3.369', 'median': '0.000', 'max': '2552.991', 'min': '-2570.567'}
Total               : {'mean': '-0.000', 'std': '3.369', 'median': '0.000', 'max': '2552.991', 'min': '-2570.567'}
------------------------------------------------------------------------------------------------------------------------
Stress(eV/Angstrom^3) distribution:
my_data             : {'mean': '1.258', 'std': '30.459', 'median': '0.000', 'max': '5474.488', 'min': '-1397.567'}
Total               : {'mean': '1.258', 'std': '30.459', 'median': '0.000', 'max': '5474.488', 'min': '-1397.567'}
------------------------------------------------------------------------------------------------------------------------
training_set size   : {'my_data': 168919}
validation_set size : {'my_data': 18768}

Calculating statistic values from dataset
Average # of neighbors: 36.994069
Use global shift, scale
shift, scale        : -5.975250, 3.368628
(1st) conv_denominator is: 36.994069
Shuffle the train data

YutackPark · 2024-08-02T05:12:15Z

Due to the fact that the values of 'structure', 'energy', 'force', 'stress', and' id 'are all lists of length 3, I am confused and will try to traverse the list to parse and initialize Atoms.

MPF dataset samples three structures per relaxation trajectory. I recommend you to check their paper for details.

"stress": -0.1 * np.array(stresses[i])}

Instead of multiplying "-0.1", try this "stress = -1 * stress / 1602.1766208 # to eV/Angstrom^3"

Standard ASE atoms instance may have eV/Angstrom^3 units for its stress. As I mentioned, the best way to ensure the script is to compare your result with the outputs of ase.io.read of any VASP OUTCAR with non-zero stress. Verify your preprocessing code before move on! Alternatively, I think it is reasonable to raise issue or discussion in M3GNet github to ask the script that converts their structure into ASE.atoms.

SevenNet log file seems good. As the dataset has ~188K structures the slow training speed you observed is expected. Sadly, pre-training SevenNet-0 is a computationally demanding task. I recommend using multi-GPU training. Cheers!

zzutao · 2024-08-02T06:59:09Z

Thank you very much for your explanation and suggestions！

thangckt · 2024-09-25T01:55:15Z

hi @YutackPark
can you have a little guide on how to efficiently handle a large json MPTraj dataset? or how to convert it to ase atoms?

Thank you so much.

YutackPark · 2024-09-25T02:26:03Z

hi @YutackPark can you have a little guide on how to efficiently handle a large json MPTraj dataset? or how to convert it to ase atoms?

Thank you so much.

Here's the code:
https://github.com/janosh/matbench-discovery/blob/main/models/sevennet/train_sevennet/convert_mptrj_to_xyz.py

Note that while the code splits dataset into train, valid, and test, SevenNet-0 used all the data in MPTrj without splitting.

Handling large dataset is another problem. (#88) A preprocessed graph (.sevenn_data) might not fit into the memory. There is an experimental feature I'm currently working on: https://github.com/MDIL-SNU/SevenNet/tree/ase_db
It uses ase db to dynamically load atoms from disk and build graph.

YutackPark added documentation Improvements or additions to documentation good first issue Good for newcomers labels Jul 31, 2024

YutackPark mentioned this issue Sep 12, 2024

anyway to read energy, force and stress from extxyz? #87

Closed

thangckt mentioned this issue Sep 12, 2024

Add options to access the energy, force, stress in ase supported format #89

Merged

YutackPark mentioned this issue Oct 18, 2024

mismatch in tensor size #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues related to data preprocessing of datasets #61

Issues related to data preprocessing of datasets #61

zzutao commented Jul 30, 2024

YutackPark commented Jul 31, 2024

zzutao commented Aug 1, 2024 •

edited

Loading

YutackPark commented Aug 2, 2024

zzutao commented Aug 2, 2024

thangckt commented Sep 25, 2024

YutackPark commented Sep 25, 2024

Issues related to data preprocessing of datasets #61

Issues related to data preprocessing of datasets #61

Comments

zzutao commented Jul 30, 2024

YutackPark commented Jul 31, 2024

zzutao commented Aug 1, 2024 • edited Loading

YutackPark commented Aug 2, 2024

zzutao commented Aug 2, 2024

thangckt commented Sep 25, 2024

YutackPark commented Sep 25, 2024

zzutao commented Aug 1, 2024 •

edited

Loading