Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Schrodinger solvation shell extraction #23

Open
wants to merge 122 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
c431bea
automated solvation shell extraction workflow
Apr 22, 2024
9c75284
begin addressing Daniel comments
May 15, 2024
ef3e241
operate on pairwise distances, account for PBCs
May 15, 2024
f092257
switch to kabsch rmsd
May 15, 2024
14749ae
add pdb file example
May 15, 2024
4280670
add dependencies to setup
May 15, 2024
09031e2
rename from rmse to rmsd
May 15, 2024
70de4e1
random seed for rmsd set instead of always starting at 0, remove reor…
May 15, 2024
3868637
start adding support for multiatom solutes, not working yet
May 17, 2024
6b8e9a7
fix star import, add al_cl04 test file
May 17, 2024
88832da
update extraction script to be the non-working Al_Cl04 one
May 17, 2024
0af4830
correct charge assignment
May 20, 2024
006c1ee
initial biolip_extraction
levineds May 31, 2024
53f52f9
metal-organics bug
levineds May 31, 2024
869b81a
minor changes
levineds May 31, 2024
bdcb588
refactor make_gaps_gly
levineds Jun 4, 2024
3334e95
Merge remote-tracking branch 'origin/main'
levineds Jun 4, 2024
4a3fcc8
Add memory estimate function
levineds Jun 4, 2024
3e5ce54
docstrings
levineds Jun 4, 2024
5e5fff3
merge conflicts
levineds Jun 4, 2024
6d327fc
Refactor
levineds Jun 5, 2024
e3182e6
docstrings
levineds Jun 5, 2024
2401268
black
levineds Jun 5, 2024
5dcc01c
isort
levineds Jun 5, 2024
adc3ebe
missing docstring
levineds Jun 5, 2024
9abc9cf
Schrodinger is Python 3.8
levineds Jun 5, 2024
8536dd7
filter to minimal pockets
levineds Jun 6, 2024
a280702
make more efficient/pythonic
levineds Jun 6, 2024
7edbc07
black
levineds Jun 6, 2024
ea015e7
store biolip_df as pickle
levineds Jun 6, 2024
7b800d6
cleanup
levineds Jun 6, 2024
68fa23c
black
levineds Jun 6, 2024
d46a506
use FAIR cluster Schrodinger
levineds Jun 12, 2024
4c5603e
disable ligand size limit
levineds Jun 12, 2024
d852562
move error message
levineds Jun 14, 2024
ce26492
alternate multiprocessing
levineds Jun 14, 2024
923168c
correct cif outname error
levineds Jun 14, 2024
c460cb7
try multiple times to download if needed
levineds Jun 14, 2024
a26d9bb
formatting
levineds Jun 14, 2024
42536c3
change for smaller sampling
levineds Jun 14, 2024
bad07e7
add error checking around mutation
levineds Jun 14, 2024
2e72c49
protection for SMARTS evaluation
levineds Jun 15, 2024
901567b
BioLiP errors
levineds Jun 15, 2024
db18286
remove duplicate "BS"
levineds Jun 17, 2024
8c2e2e6
add timeout to architector structure generation
levineds Jun 18, 2024
cdfbb02
formatting
levineds Jun 18, 2024
f36e450
improve xyz generation
levineds Jun 18, 2024
46df040
add slightly more error checking
levineds Jun 20, 2024
301c199
more fixes
levineds Jun 25, 2024
b56de8d
function already exists of course
levineds Jun 26, 2024
30ed02c
trying to handle multi-atom solutes, committing before migrating to s…
Jun 27, 2024
1c7fd1a
start schrodinger workflow
Jul 1, 2024
d65860f
can extract shells, now need to loop over solutes, apply pbcs and add…
Jul 1, 2024
ed22d93
loop over solutes
Jul 1, 2024
82fff88
fix loop nesting
Jul 1, 2024
1ef6e1b
add partial charge assignment
Jul 1, 2024
4a55633
fix pbc wrapping index error
Jul 1, 2024
66b4afd
append central atom to shell if not already present, have fully runni…
Jul 2, 2024
ac25b01
cleanup files
Jul 2, 2024
46aab0d
ignore readme
Jul 2, 2024
af2245b
begin solvent-solvent stuff, not working
Jul 2, 2024
e0479be
revert to working solvent-solvent interactions, need to implement PBCs
Jul 2, 2024
9bc0a70
uncomment solvent part
Jul 2, 2024
e31bf22
switch to superimpose version of rmsd
Jul 2, 2024
3ee9aa0
make size checking caps hard
Jul 2, 2024
8e2a1e0
address some more pr comments - more accurate comments, restructuring…
Jul 2, 2024
b2b836d
option to skip solvent centered shells
Jul 2, 2024
cb1139c
remove equals from dir name
Jul 2, 2024
f7bd798
assign partial charges once outside loop
Jul 2, 2024
68dabb2
partial charge assignment once up front
Jul 2, 2024
ca95322
remove size filtering function
Jul 2, 2024
eb12dd8
more explicit version of expand shells function
Jul 2, 2024
f432765
fix central solute molecule bug
Jul 2, 2024
53b6101
simplify filter solute function
Jul 2, 2024
1af70a4
filter after batches are made
levineds Jul 3, 2024
4d9f53b
label ligands after coordinating groups
levineds Jul 3, 2024
aa7d024
turn off shell expansion if there are no solvents
Jul 4, 2024
34d2f21
lognormal sampling of max size
Jul 4, 2024
a51ea9b
option to subsample solutes/solvents per frame
Jul 4, 2024
23e320a
fix chain-labeling/angle adjustment bug
levineds Jul 5, 2024
c0f36e8
add cleanup script for pdb pockets
levineds Jul 5, 2024
b487187
cleanup cleanup.py
levineds Jul 5, 2024
b24eac1
add combine_pockets.py to merge certain pockets
levineds Jul 5, 2024
d413430
flake8
levineds Jul 5, 2024
39abfb0
remove correct directory
levineds Jul 5, 2024
66fcc21
updates
levineds Jul 8, 2024
c48f21e
Take PDBs first over CIFs
levineds Jul 8, 2024
3e65d58
speedups for are_conformers
levineds Jul 9, 2024
0e54a94
cleanup
levineds Jul 9, 2024
6720a55
Merge remote-tracking branch 'refs/remotes/origin/main'
levineds Jul 9, 2024
3966421
some logging, restrict to 100 frames for now
levineds Jul 10, 2024
c877a8b
basic docstrings
levineds Jul 10, 2024
2963480
refactor into separate functions, re-use code
levineds Jul 10, 2024
f225f37
refactor bug
levineds Jul 10, 2024
aff9ec2
make solute-free assertion a warning instead
Jul 10, 2024
9657238
add max frames argument
Jul 10, 2024
e0f2321
arg bug
Jul 10, 2024
dd8e52d
update cleanup printing
levineds Jul 15, 2024
7a05a75
delete old files after combining pockets
levineds Jul 15, 2024
23763fb
some more cleanup tools
levineds Jul 16, 2024
02ce480
there are nitriles...
levineds Jul 17, 2024
bd6aa7d
threshold was a little too tight
levineds Jul 18, 2024
44a56ea
also remove totally overlapping atoms (this may be addressed by chang…
levineds Jul 18, 2024
9bcc118
oops
levineds Jul 18, 2024
53e4901
deprotonate carboxylic acids always
levineds Jul 19, 2024
bef0a7e
add more cleanup steps that probably aren't needed
levineds Jul 19, 2024
d19e4a7
partition off nuclear option in cleanup.py
levineds Jul 19, 2024
60f627e
don't try and redownload if we already have it
levineds Jul 19, 2024
47d9295
oops
levineds Jul 19, 2024
514ea62
make sure COOH is deprotonated
levineds Jul 19, 2024
7b1d14a
add files for running MD with Desmond
levineds Jul 19, 2024
07b391d
use absolute path
levineds Jul 20, 2024
abf0607
ensure ligand chain is present
levineds Jul 22, 2024
6128728
Use chain to identify solvent vs. solute
levineds Jul 26, 2024
b3eaf6c
Merge remote-tracking branch 'refs/remotes/origin/main'
levineds Jul 31, 2024
65b9c0b
Sample OOD solvation shells for 10% of inputs
levineds Jul 31, 2024
2cc5793
merge main
levineds Aug 13, 2024
a477ada
Revert "merge main"
levineds Aug 13, 2024
abe6e40
Reapply "merge main"
levineds Aug 13, 2024
23bcacf
Revert "merge main"
levineds Aug 13, 2024
654a8a6
Reapply "merge main"
levineds Aug 13, 2024
a9b5124
refactor and add some comments
Aug 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.pdb filter=lfs diff=lfs merge=lfs -text
*.pdb filter=lfs diff=lfs merge=lfs -text
18 changes: 18 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ checkpoints
results
logs
*.traj
*.pdb
experimental

# Byte-compiled / optimized / DLL files
Expand Down Expand Up @@ -111,3 +112,20 @@ Local

# VS Code
.vscode/

# testfiles
testfiles/
testfiles_old/
*mae

# Schrodinger stuff
maestro_package/

# Jupyter Notebook
.ipynb_checkpoints/
*ipynb

# Misc
notes.txt


1 change: 1 addition & 0 deletions electrolytes/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*pdb filter=lfs diff=lfs merge=lfs -text
Empty file removed electrolytes/README.md
Empty file.
8 changes: 8 additions & 0 deletions electrolytes/run_extraction.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash
#TODO: can we automatically extract the names of all the solute atoms from the PDB file so we don't have to re-run this command for each solute?


$SCHRODINGER/run python3 solvation_shell_extract.py --input_dir 'testfiles/1' \
--save_dir 'results' \
--system_name 'Li_BF4'

266 changes: 266 additions & 0 deletions electrolytes/solvation_shell_extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
from typing import List
import logging

logging.basicConfig(level=logging.INFO)

import random
import os
from tqdm import tqdm
import json
import argparse
import numpy as np

from schrodinger.structure import StructureReader
from schrodinger.structutils import analyze
from solvation_shell_utils import (
expand_shell,
renumber_molecules_to_match,
filter_by_rmsd,
filter_shells_with_solute_atoms,
)
from utils import validate_metadata_file
from schrodinger.comparison import are_conformers
from schrodinger.application.jaguar.utils import group_with_comparison
from schrodinger.structutils import rmsd


def extract_solvation_shells(
input_dir: str,
save_dir: str,
system_name: str,
solute_radii: List[float],
solvent_radii: List[float],
top_n: int,
):
"""
Given a MD trajectory in a PDB file, perform a solvation analysis
on the specified solute to extract the first solvation shell.

Args:
input_dir: Path to 1) the PDB file containing the MD trajectory (system_output.pdb) and 2) a metadata file (system_metadata.json)
save_dir: Directory in which to save extracted solvation shells.
system_name: Name of the system - used for naming the save directory.
solute_radii: List of shell radii to extract around solutes.
solvent_radii: List of shell radii to extract around solvents.
top_n: Number of snapshots to extract per topology.
"""

# Read a structure and metadata file
# TODO: add charges
logging.info("Reading structure and metadata files")

# Read metadata
with open(os.path.join(input_dir, "metadata_system.json")) as f:
metadata = json.load(f)

validate_metadata_file(metadata)

partial_charges = np.array(metadata["partial_charges"])

solutes = {
species: res
for res, species, type in zip(
metadata["residue"], metadata["species"], metadata["solute_or_solvent"]
)
if type == "solute"
} # {species: residue name} mapping
solute_resnames = list(solutes.values())

solvents = {
species: res
for res, species, type in zip(
metadata["residue"], metadata["species"], metadata["solute_or_solvent"]
)
if type == "solvent"
} # {species: residue name} mapping
jeevster marked this conversation as resolved.
Show resolved Hide resolved
solvent_resnames = list(solvents.values())
jeevster marked this conversation as resolved.
Show resolved Hide resolved

# Read a structure
structures = StructureReader(os.path.join(input_dir, "system_output.pdb"))

# For each solute: extract shells around the solute of some heuristic radii and bin by composition/graph hash
# Choose the N most diverse in each bin

for species, residue in solutes.items():
logging.info(f"Extracting solvation shells around {species}")
for radius in solute_radii:
logging.info(f"Radius = {radius} A")
expanded_shells = []
for i, st in tqdm(enumerate(structures)): # loop over timesteps
# assign partial charges to atoms
for at, charge in zip(st.atom, partial_charges):
at.partial_charge = charge
jeevster marked this conversation as resolved.
Show resolved Hide resolved

if i > 10: # TODO: fix this
break
jeevster marked this conversation as resolved.
Show resolved Hide resolved

# extract all solute molecules
solute_molecules = [
res for res in st.residue if res.pdbres.strip() == f"{residue}"
jeevster marked this conversation as resolved.
Show resolved Hide resolved
]
central_solute_nums = [mol.molecule_number for mol in solute_molecules]
# Extract solvation shells
shells = [
set(
analyze.evaluate_asl(
st, f"fillres within {radius} mol {mol_num}"
)
)
for mol_num in central_solute_nums
]

# Now expand the shells
for shell, central_solute in zip(shells, central_solute_nums):
expanded_shell = expand_shell(
st,
shell,
central_solute,
radius,
solute_resnames,
max_shell_size=200,
)
expanded_shells.append(expanded_shell)

# Now compare the expanded shells and group them by similarity
# we will get lists of lists of shells where each list of structures are conformers of each other
logging.info("Grouping solvation shells into conformers")
# TODO: speed this up
grouped_shells = group_with_comparison(expanded_shells, are_conformers)
jeevster marked this conversation as resolved.
Show resolved Hide resolved

# Now ensure that topologically related atoms are equivalently numbered (up to molecular symmetry)
grouped_shells = [
renumber_molecules_to_match(items) for items in grouped_shells
]

# Now extract the top N most diverse shells from each group
logging.info(f"Extracting top {top_n} most diverse shells from each group")
final_shells = []
# example grouping - set of structures
for shell_group in tqdm(grouped_shells):
filtered = filter_by_rmsd(shell_group, n=top_n)
final_shells.extend(filtered)

# Save the final shells
logging.info(f"Saving final shells")
save_path = os.path.join(save_dir, system_name, species, f"radius={radius}")
jeevster marked this conversation as resolved.
Show resolved Hide resolved
os.makedirs(save_path, exist_ok=True)
for i, st in enumerate(final_shells):
# TODO: seems like this is saving an extra line at the end of the xyz files
st.write(os.path.join(save_path, f"shell_{i}.xyz"))

# Now repeat for solvents to capture solvent-solvent interactions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since so much of this code is identical with the above loop, it would be better to make this a function and call it twice with different arguments.

for species, residue in solvents.items():
logging.info(f"Extracting shells around {species}")
for radius in solvent_radii:
logging.info(f"Radius = {radius} A")
filtered_shells = []
for i, st in tqdm(enumerate(structures)): # loop over timesteps
# assign partial charges to atoms
for at, charge in zip(st.atom, partial_charges):
at.partial_charge = charge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this important for some reason?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the final outputted files needed to include the partial charge information since it's needed for DFT, so I added it here. But it looks like it's not actually getting written to the output .xyz files properly (they still only include positions). Do you have a suggestion on how to do this? Maybe I need to write to PDB format to incorporate charge info?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final structures need the total charge and spin, not partial charges.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these partial charges actually floats or are they formal charge? If you add all the charges to at.formal_charge then at the end when you have extracted things, you can do st.formal_charge to get the sum of formal charges on all the atoms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partial charges are floats. But since we only need total charge for DFT, can we do the process you just described? I.e assign all partial charges, and get st.partial_charge at the end to get the sum across all atoms in a shell?


if i > 10: # TODO: fix this
break

# extract all solvent molecules
solvent_molecules = [
res for res in st.residue if res.pdbres.strip() == f"{residue}"
jeevster marked this conversation as resolved.
Show resolved Hide resolved
]
central_solvent_nums = [
mol.molecule_number for mol in solvent_molecules
]

# Extract solvation shells
shells = [
set(
analyze.evaluate_asl(
st, f"fillres within {radius} mol {mol_num}"
)
)
for mol_num in central_solvent_nums
]

# Only keep the shells that have no solute atoms
filtered_shells.extend(
filter_shells_with_solute_atoms(shells, st, solute_resnames)
)

# Choose a random subset of shells
assert len(filtered_shells) > 0, "No solute-free shells found for solvent"
random.shuffle(filtered_shells)
filtered_shells = filtered_shells[:1000]

# Now compare the expanded shells and group them by similarity
# we will get lists of lists of shells where each list of structures are conformers of each other
logging.info("Grouping solvation shells into conformers")
# TODO: speed this up
grouped_shells = group_with_comparison(filtered_shells, are_conformers)

# Now ensure that topologically related atoms are equivalently numbered (up to molecular symmetry)
grouped_shells = [
renumber_molecules_to_match(items) for items in grouped_shells
]

# Now extract the top N most diverse shells from each group
logging.info(f"Extracting top {top_n} most diverse shells from each group")
final_shells = []
# example grouping - set of structures
for shell_group in tqdm(grouped_shells):
filtered = filter_by_rmsd(shell_group, n=top_n)
final_shells.extend(filtered)

# Save the final shells
logging.info(f"Saving final shells")
save_path = os.path.join(save_dir, system_name, species, f"radius={radius}")
os.makedirs(save_path, exist_ok=True)
for i, st in enumerate(final_shells):
# TODO: seems like this is saving an extra line at the end of the xyz files
st.write(os.path.join(save_path, f"shell_{i}.xyz"))


if __name__ == "__main__":
logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))
random.seed(10)
jeevster marked this conversation as resolved.
Show resolved Hide resolved

parser = argparse.ArgumentParser()
parser.add_argument(
"--input_dir",
type=str,
help="Path containing PDB trajectory and LAMMPS data files",
)
parser.add_argument("--save_dir", type=str, help="Path to save xyz files")
parser.add_argument(
"--system_name", type=str, help="Name of system used for directory naming"
)

parser.add_argument(
"--solute_radii",
type=list,
default=[3],
help="List of shell radii to extract around solutes",
)

parser.add_argument(
"--solvent_radii",
type=list,
default=[3],
help="List of shell radii to extract around solvents",
)

parser.add_argument(
"--top_n",
type=int,
default=20,
help="Number of most diverse shells to extract per topology",
)

args = parser.parse_args()

extract_solvation_shells(
args.input_dir,
args.save_dir,
args.system_name,
args.solute_radii,
args.solvent_radii,
args.top_n,
)
Loading