Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detailed molecular basis set specification (output of a calculation, not input) #12

Open
tovrstra opened this issue Aug 18, 2017 · 38 comments

Comments

@tovrstra
Copy link
Contributor

tovrstra commented Aug 18, 2017

Rough idea

The current schema does not yet have standardized method for specifying a Gaussian (or Slater) AO basis that was used in a calculation for a given molecule. Just to be clear: I'm looking for a way to fully specify a basis set in the output of a calculation, after the calculation has been carried out. This is different from specifying a basis set in the input for a calculation, because in the output you want to know how to interpret quantities expressed in an AO basis. For that you need more information, namely:

  • Order of the basis functions within one shell.
  • Normalization and sign conventions for the basis functions.
  • The center for each shell of basis functions.

Because basis functions are not necessarily centered on nuclei, an array with centers must be included as well.

I think we can mostly JSON-ize the molden format (see [GTO], [STO] and [MO] sections of http://www.cmbi.ru.nl/molden/molden_format.html) plus some improvements:

  • Make the ordering of the basis functions within one shell (and pure versus Cartesian) explicit as follows:

    {"conventions": {
      "d": ["xx", "yy", "zz", "xy", "xz", "yz"],
      "f": ["f0c", "f1c", "f1s", "f2c", "f2s", "-f3s", "-f3c"]
    }}

    Cartesian functions always match the x*y*z* regular expression. For pure functions, the string consists of four parts: sign (optional), angular momentum (letter code), magnetic quantum number (absolute value), s or c for sine- or cosine-like. The optional sign is needed because some codes (e.g. ORCA) have unusual sign conventions, different from most other QC codes. The line for the f-functions in the example is how ORCA orders basis functions (and flips signs) in AO arrays in a Molden file. (This is different from what the Molden program actually expects.) To fix the meaning of the strings for the pure functions completely, we should write out the mathematical form in the JSON schema.

  • Allow very high angular momenta, exhausting the whole alphabet (except j, see Wikipedia, more authoritative reference always welcome): ["s", "p", "d", "f", "g", "h", "i", "k", "l", "m", "n", "o", "q", "r", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "e"].

  • Assume contractions are normalized already, i.e. do not assume that the program reading the contraction coefficients will fix the normalization for you. Instead, the program writiing out the contractions should take care of that. (The Molden format assumes contractions need to be normalized upon reading, which is not suitable for all cases.)

  • Support (very) generalized contractions, i.e. also compatible with basis sets from CP2K. See https://github.com/cp2k/cp2k/blob/master/cp2k/data/BASIS_MOLOPT (It does not get more general than that.) For every shell we should have a list of angular momenta, e.g. ["s", "p"], ["s", "s"] or ["s", "s", "s", "p", "p", "d"]. The last example is something you could encounter in a molopt basis set.

  • Keep a list of centers for the shells, which can be different from atomic positions, e.g. when using ghost atoms or doing other funny things.

  • Include pseudopotential specification? (I'm not an expert.)

  • Details for STO basis functions need to be worked out. (I'm not an expert on that either.)

Example for NH3

{"orbital_basis": {
  "type": "gto",
  "conventions": {
    "p": ["x", "y", "z"],
    "d": ["d0c", "d1c", "d1s", "d2c", "d2s"],
  },
  "shells": [
    [0, 8, ["s", "s"],
     [0.9046E+4, 0.1357E+4, 0.3093E+3, 0.8773E+2, 0.2856E+2, 0.1021E+2, 0.3838E+1, 0.7466],
     [0.6996174134E-3, 0.538605463E-2, 0.2739102119E-1, 0.103150592, 0.2785706633, 0.4482948495, 0.2780859284, 0.1543156123E-1],
     [-0.304990096E-3, -0.2408026379E-2, -0.1194444873E-1, -0.489259929E-1, -0.1344727247, -0.3151125777, -0.2428578325, 0.1094382207E+1],
    ],
    [0, 1, ["s"], [0.2248], [1.0]],
    [0, 1, ["s"], [0.6124E-1], [1.0]],
    [0, 3, ["p"],
     [0.1355E+2, 0.2917E+1, 0.7973],
     [0.5890567677E-1, 0.3204611067, 0.7530420618]],
    [0, 1, ["p"], [0.2185], [1.0]],
    [0, 1, ["p"], [0.5611E-1], [1.0]],
    [0, 1, ["d"], [0.817], [1.0]],
    [0, 1, ["d"], [0.23], [1.0]],
    [1, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [1, 1, ["s"], [0.122], [1.0]],
    [1, 1, ["s"], [0.2974E-1], [1.0]],
    [1, 1, ["p"], [0.727], [1.0]],
    [1, 1, ["p"], [0.141], [1.0]],
    [2, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [2, 1, ["s"], [0.122], [1.0]],
    [2, 1, ["s"], [0.2974E-1], [1.0]],
    [2, 1, ["p"], [0.727], [1.0]],
    [2, 1, ["p"], [0.141], [1.0]],
    [3, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [3, 1, ["s"], [0.122], [1.0]],
    [3, 1, ["s"], [0.2974E-1], [1.0]],
    [3, 1, ["p"], [0.727], [1.0]],
    [3, 1, ["p"], [0.141], [1.0]],
  ],
  "centers": [
    [-0.0140883131, 0.0845903925, 0.1037711513]
    [1.4952113836, 0.0214187375, 0.0445603623]
    [-0.5919457779, -1.6621666211, 0.5350312215]
    [-0.7075176488, 0.4654193413, -2.0214243076]
  ]
}}

Details

The shells field is a list, where each item represents one generalized contraction stored as a list with the following items:

  • center index (counting from zero?)
  • the number of primitives
  • the angular momenta for the generalized contraction
  • the Gaussian exponents
  • one or more lists of contraction coefficients. (More are present in case of generalized contractions.)

To do

  • Units
  • Support different normalization conventions. Furthermore, Cartesian functions have different normalization constants within one shell, which is causing some ambiguity.
@tovrstra
Copy link
Contributor Author

I've updated the text above. The changes were mostly related to pure functions.

@berquist
Copy link
Contributor

Allow very high angular momenta: s, p, d, f, g, h, i, j, k, l, m, n, o. (Not sure what should be done after o?)

From Wikipedia:

The letters after the f sub-shell just follow letter f in alphabetical order except the letter j and those already used.

so o, q, r, t, u, v, w, x, y, z, ...

@tovrstra
Copy link
Contributor Author

@berquist Thanks! I'll fix this (and other remarks later) in the top post.

@dgasmith
Copy link
Collaborator

@twindus @wadejong Do you guys have a link to the EMSL BSE XML spec?

@tovrstra
Copy link
Contributor Author

I think I found it: https://bse.pnl.gov/bse/docs/schemas/gaussian.xsd I'll take a look. I think this paper explains the xml spec to some extent: http://pubs.acs.org/doi/pdf/10.1021/ci600510j Are there other references?

There is also https://bse.pnl.gov/bse/docs/schemas/ecp.xsd for effective core potentials.

@gkc1000
Copy link

gkc1000 commented Aug 18, 2017

Core potentials as well?

@tovrstra
Copy link
Contributor Author

@gkc1000 Core potentials are not as urgent because they are not needed to visualize orbitals or to compute densities. They could be useful for more specialized post-processing, e.g. follow-up calculations with other levels of theory

@dgasmith
Copy link
Collaborator

The EMSL BSE XML has core potentials AFAIK. If its not in the document its in another.

@wadejong
Copy link
Collaborator

wadejong commented Aug 18, 2017 via email

@tovrstra
Copy link
Contributor Author

@wadejong : are these discussion done on an online forum, so we can follow up on them?

@wadejong
Copy link
Collaborator

wadejong commented Aug 18, 2017 via email

@tovrstra
Copy link
Contributor Author

Thanks for the info. I looked at the EMSL schema for basis sets. These are my main thoughts. I hope it can be of use:

  • the EMSL scheme It serves a different purpose, namely to describe a basis set for a list of elements in the periodic table. It is not meant to describe molecular basis sets. Still there is a lot of overlap, so we could have some useful exchange of ideas.

  • The support for generalized contractions is rather limited in the EMSL XML scheme. The very general contractions as in CP2K are apparently not supported.

  • The EMSL scheme does not say anything about ordering of basis functions within one shell because it is not relevant in that context.

  • The EMSL scheme does not enforce any kind of normalization conventions for the contractions. (We ran into examples of unnormalized contractions at some point: Horton integrals are normalized-weirdly.  theochem/horton#39.) We really need this to correctly interpret orbitals and other data. Enforcing normalization conventions is not very important for the EMSL scheme.

As it is, the EMSL scheme is not a one-stop solution here for two reasons: some features are missing and it is not meant to be used for describing molecular basis sets.

@wadejong
Copy link
Collaborator

wadejong commented Aug 18, 2017 via email

@tovrstra
Copy link
Contributor Author

@wadejong Could you post the Gaussian basis JSON format please? No rush. Thanks!

@wadejong
Copy link
Collaborator

wadejong commented Aug 22, 2017 via email

@wadejong
Copy link
Collaborator

wadejong commented Aug 23, 2017 via email

@tovrstra
Copy link
Contributor Author

@wadejong Thanks!

It seems that these JSON files have a purpose comparable to the EMSL XML scheme, i.e. describing basis functions for atoms. This is typically the input one would give to a QC code, which is somewhat different from the problem I'm trying to address, how to represent the molecular basis set in the output of a QC calculation. (See comment above on XML scheme.)

After going through the https://github.com/MolSSI/QC_JSON_Schema/commit/ce3230167f4cc1276f1ff1128ff9c17b9759cdea, I wondered if it is preferable to work with dictionaries to represent the shells? I'm the idea above, I've just used a list in which the order of the elements fixes their meaning. That is not as self-explaining but it makes the JSON more compact.

@tovrstra tovrstra changed the title Detailed molecular basis set specification Detailed molecular basis set specification (output of a calculation, not input) Aug 23, 2017
@FarnazH
Copy link

FarnazH commented Apr 7, 2019

@tovrstra I mihgt be missing sth, but didn't we decide to represent shells as a list of namedtuple('Shell', ['icenter', 'iatom', 'angmoms', 'exponents', 'contractions'])?

@cryos
Copy link
Collaborator

cryos commented Apr 8, 2019

I have been working on this in Chemical JSON, and use a number of related arrays to arrange the data in linear arrays following the large array convention we decided to adopt. We have a basisSet object, and that has arrays for coefficients, primitivesPerShell, shellToAtomMap, shellTypes. There is an orbitals object too that has energies, moCoefficients (for alpha and beta optionally), occupations, and then we have an object for vibrations too. The example file has a perhaps too large molecule but a working implementation.

We have been able to use this along with coordinate sets and multiple MO coefficient arrays to animate a series of calculations where multiple calculations were performed on a system as it bonded.

@tovrstra
Copy link
Contributor Author

tovrstra commented Apr 8, 2019

@cryos Thanks for the information! I have a few small questions:

Is there a detailed description of this JSON format, e.g. also including normalization conventions of primitives and contracted basis functions?

Would you have a smaller example too? (E.g. for testing purposes.)

Which implementations can already write Chemical JSON or read it for visualization and post-processing?

How does Chemical JSON handle generalized contractions?

@dgasmith
Copy link
Collaborator

Coming back to this, we added a basis input specification from @bennybp which is a mimic of the data structure used in the new Basis Set Exchange. As a note this has been added here. This covers all use cases extant in the BSE and should provide good coverage of what is out there (@bennybp may have a few items to add).

I believe the only item that does not exist is ordering. For ordering I am leaning a bit towards simply providing ordering arrays as per CCLib. I believe the only other viable option would be the CCA standard?

@twindus
Copy link

twindus commented Apr 23, 2019

Either of these options seems reasonable. There was a lot of support for just doing CCA at the recent ACS symposium - anecdotal, but....

@tovrstra
Copy link
Contributor Author

@dgasmith Could you post a link to the CCA standard? Thanks.

@dgasmith
Copy link
Collaborator

The CCA paper can be found here. Same ordering as FCHK/Psi4 here. I think there may be some oddities in FCHK higher order cartesian IIRC.

@tovrstra
Copy link
Contributor Author

Thanks. I'll come back to it later.

@dgasmith
Copy link
Collaborator

At the very least we could recommend the CCA ordering even if we support others. It would make things simpler in many cases.

@tovrstra
Copy link
Contributor Author

I would not mind fixing the ordering of basis functions and related conventions in QCSchema, as long as it is all clearly defined. Something like "the same as in program ABC" is not good enough because most other programs never explicitly define their conventions. Here are some examples of potential sources of confusion from the codes mentioned above (not meant as criticism, just to illustrate):

  • The CCA paper only seems to mention the order of the Cartesian primitives, not pure ones, unless I missed something.

  • The FCHK format uses a different ordering than what PSI4 uses internally (for both pure and Cartesian functions), so I'm not sure what FCHK/PSI4 ordering means.

  • Libint can be configured to work with different orderings.

  • ...

Can we include in the QCSchema documentation or definition the following information?

  • Standard ordering of Cartesian and pure basis functions within one shell, with equations to rule out any ambiguities. (For example, ORCA uses rather odd sign conventions for pure functions for m>=3. Such aspects should be fixed by convention.)

  • Standard normalization of primitives, with equations.

  • How to handle unnormalized contractions? (As far as I remember, not all basis sets on EMSL have normalized contractions. Most programs renormalize them when setting up the basis set.)

@dgasmith
Copy link
Collaborator

dgasmith commented May 7, 2019

@twindus Do you have the CCA order for spherical basis sets? I think its 0, -/+ 1, ...

On that particular topic there is the fun question of what to do with p orbitals as the spherical transform does not effect the orbitals, just the order. Most programs do not bother with the transform.

I am leaning a bit back towards saying that the AO quantities must be in CCA order. It puts slightly more onus on the generation of the schema data, but makes everything downstream much easier to handle. We can supply translation routines in QCElemental/QCEngine that can handle this.


  • We can generate pure/cartesian orderings no problem. @tovrstra, can you expand a bit on the ORCA sign convention issue?
  • Do you have the normalization equations sitting around? I poked through libint's docs, but didn't find anything.
  • Does unnormalized contractions hurt us? The program should write out the basis and orbitals in whatever normalization was used internally and if they do/do not convert it should be encoded I believe.

@tovrstra
Copy link
Contributor Author

tovrstra commented May 7, 2019

Regarding ORCA, it could be specifically an issue in their routines to write out Molden files. When we read in Molden files with apparently incorrectly normalized orbitals, we run a series of checks to detect common deviations from the original file format. Orca has a sign change in pure AO functions from m>=3 on, which could (or not) reflect their internal conventions. Our settings for reading ORCA Molden files can be found here: https://github.com/theochem/iodata/blob/master/iodata/formats/molden.py#L393

@dgasmith
Copy link
Collaborator

dgasmith commented Jun 4, 2019

From discussion with @wadejong and @cryos this week at Berkeley I think we are leaning towards having purely the CCA standard within the schema. This puts some additional effort upfront, but ultimately makes things far simpler.

Looking at the CCA standard (https://onlinelibrary.wiley.com/doi/epdf/10.1002/jcc.20815) the spherical ordering is L, ..., 0, ..., -L. However, the libint documentation appears to describe the CCA ordering as -L, ..., 0, ..., L.

@evaleev @twindus Is there a misunderstanding here?

@evaleev
Copy link

evaleev commented Jun 4, 2019

the CCA specification of the solid harmonics ordering in our paper is incomplete and thus I cannot conclude whether what libint implemented is “correct” or “not”, e.g. it does not specify what m_l means.

according to this libint implements what the de-facto CCA standard defining code (MPQC2) defined: https://github.com/ValeevGroup/mpqc/blame/19ad332791d5966c98420d926364d68aeec859ae/src/lib/chemistry/qc/basis/transform.cc#L101

so I think this is what the de-facto CCA order is, since the specification in the paper is (unfortunately) incomplete

@tovrstra
Copy link
Contributor Author

tovrstra commented Jun 5, 2019

@dgasmith @evaleev Thanks for sharing the info.

I agree, even when stating the ordering is m_l = -L ... L or m_l = L ... -L, it remains unclear what this actually means. For real (not complex) solid harmonics, the meaning of the sign of m_l is rather arbitrary. The only way to make it unambiguous is to just include the equations of the real solid spherical harmonics in the schema documentation. While doing so, one could also include equations for Cartesian ones and the normalization conventions, just to nail it down.

@dgasmith
Copy link
Collaborator

dgasmith commented Jun 5, 2019

@evaleev Can we borrow these definitions from your living wikipedia page? @tovrstra could you work on writing up the format if this is the case?

@evaleev
Copy link

evaleev commented Jun 5, 2019 via email

@tovrstra
Copy link
Contributor Author

tovrstra commented Jun 6, 2019

I wouldn't mind and attribution is obviously not a problem. I'm not sure what is meant by "living wikipedia page"? Could someone post a link? Thanks.

@dgasmith
Copy link
Collaborator

dgasmith commented Jun 6, 2019

That appears to be a autocorrection issue "libint wikipedia page", as seen here.

@evaleev We have had a minimal version 1 out which carries energy-based quantities and some properties-- we took some time to explore how that works in practice. You should be seeing a larger push from MolSSI to get more quantities into the schema.

@loriab
Copy link
Collaborator

loriab commented Oct 3, 2019

@devinamatthews contributes a nice snippet for mapping basis function indices from Psi4 to CFOUR (C1 only).

    bas = wfn.basisset()

    with open("GENBAS", "w") as genbas:
        genbas.write(bas.genbas())

    maxl = bas.max_am()
    map = []
    lmap = [[0],
            [1,2,0],
            [0,4,1,3,2],
            [1,2,0,5,4,6,3],
            [0,4,1,7,6,3,8,5,2],
            [1,2,3,5,8,10,7,6,0,9,4],
            [11,4,9,7,10,3,12,5,8,0,6,2,1]]

    for i in range(mol.natom()):
        for l in range(maxl+1):
            for li in range(2*l+1):
                for j in range(bas.nshell()):
                    s = bas.shell(j)
                    if bas.shell_to_center(j) == i and s.am == l:
                        k = bas.shell_to_basis_function(j)
                        if k > 6:
                            m = l - li//2
                            k += 2*m + li%2
                        else:
                            k += lmap[l][li]
                        map.append(k)

@devinamatthews
Copy link

Thanks @loriab. While this is an example of a one-off mapping (and it was hard enough to figure out even in C1 with no reorientation), it seems like using the MOLDEN format or a JSON format as discussed here could make this much more portable.

Basically what I mean is that if someone were to write a tool that took two MOLDEN/JSON descriptions and generated the transformation matrix from one to the other, then reading in orbitals from another program would be as easy as reversing the MOLDEN-writer code and doing a matrix multiply. Such a tool would have to handle:

  1. Rotation and possible inversion between the molecular frames.
  2. Reordering of the atoms, contracted functions, and primitives.
  3. Reuse of primitives between contracted functions. For example, CFOUR's cc-pVDZ basis for O has 9 primitives with non-zero coefficients for the 1s and 2s functions, where the 9th primitive is shared with the polarization function. Other programs only use the first 8 primitives for the 1s and 2s functions. You get the same energy but it's a different basis.
  4. (optional maybe) Transformation between Cartesians and spherical harmonics

Maybe this is as easy as just evaluating a bunch of overlap integrals?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests