-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping multi chain components #47
Conversation
Hello @RiesBen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-09-26 02:53:50 UTC |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
+ Coverage 96.60% 97.53% +0.93%
==========================================
Files 13 13
Lines 618 649 +31
==========================================
+ Hits 597 633 +36
+ Misses 21 16 -5 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
suggesting implementation for _split_protein_component_chains
I've also realized that the way we are splitting by chains might not doing what we want, I think we would like to go with a similar approach to what the For example, the structure of TYK2 in PLB repo has two waters (6 atoms in total), and if you check this approach you have something like the following: In [4]: tyk2_comp = ProteinComponent.from_pdb_file(f"{tyk2_basepath}/protein.pdb")
In [5]: tyk2_rdmol = tyk2_comp.to_rdkit()
In [6]: tyk2_rdmol.GetNumAtoms()
Out[6]: 4658
In [7]: mapper = KartografAtomMapper(atom_map_hydrogens=True)
In [8]: chains = mapper._split_protein_component_chains(tyk2_comp)
In [9]: chains
Out[9]: [ProteinComponent(name=0_A), ProteinComponent(name=1_A)]
In [10]: chains[1].to_rdkit().GetNumAtoms()
Out[10]: 4
In [11]: chains[0].to_rdkit().GetNumAtoms()
Out[11]: 4652 So there are some missing atoms in the waters when using this function to split the components by chain. |
It seems that there is a different behavior for |
src/kartograf/atom_mapper.py
Outdated
for mapping_obj in largest_mappings: | ||
start_a = int(mapping_obj.componentA.name.split("_")[-1]) | ||
start_b = int(mapping_obj.componentB.name.split("_")[-1]) | ||
shifted_map = {a_idx + start_a: b_idx + start_b for a_idx, b_idx in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note here that we have a little bit of a footgun here, when we shift the indices and we update the dictionary it is possible that some of the indices get overwritten and that means that probably something went wrong.
I don't know what's a good solution for this, but maybe we should think about having yet another class that handles this itself, maybe inheriting from dict
and throwing an exception when a __setitem__
overwrites something that already exists. Just a guess at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even at the expense of non-fancy code & extra cost, it might be good to do a check on the indices and make sure that there's no dupllicates. In its simplest form, just a loop where you create the two lists, turn them into sets and see if the length changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An initial review / discussion points.
src/kartograf/atom_mapper.py
Outdated
atom_index = atom.GetIdx() | ||
if not (atom_index in index_tuple): | ||
remove_indices.append(atom_index) | ||
# Need to remove separately https://github.com/rdkit/rdkit/issues/1366 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that this is because the atom ids get re-assigned on the fly?
From some pen and paper playing around, I think this is should work in all cases - are you reasonably confident of this too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my understanding. It is happening on the fly, so the iterator gets invalidated and the behavior is undefined.
src/kartograf/atom_mapper.py
Outdated
for atom_idx in sorted(remove_indices, reverse=True): | ||
edit_rdmol_frag.RemoveAtom(atom_idx) | ||
# Create component with the remaining molecule | ||
frag_rdmol = edit_rdmol_frag.GetMol() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do anything about bond orders? I.e. do we know if removing the atoms also re-adjusts the bonds in the molecule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a test that checks the bond orders fror the components being returned would be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good check to do. Yes.
src/kartograf/atom_mapper.py
Outdated
for mapping_obj in largest_mappings: | ||
start_a = int(mapping_obj.componentA.name.split("_")[-1]) | ||
start_b = int(mapping_obj.componentB.name.split("_")[-1]) | ||
shifted_map = {a_idx + start_a: b_idx + start_b for a_idx, b_idx in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even at the expense of non-fancy code & extra cost, it might be good to do a check on the indices and make sure that there's no dupllicates. In its simplest form, just a loop where you create the two lists, turn them into sets and see if the length changed?
@RiesBen - having @hannahbaumann @jthorton take over the review of this PR might be a good handover exercise. This seems like the type of thing that would expose folks to most of the Kartograf functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far, the only blocking change would be to separate out the protein protein specific logic into its own function the other feedback should be considered optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Really love the performance improvements and cleaner code. I added a few comments that I think we should address.
I just realized that we haven't really dealt with the case where an user tries to do a mapping with mixed components (such as a mapping between a ProteinComponent
and a SmallMoleculeComponent
). This has the potential to face combinatorial explosion. Maybe we should just support mappings between the same types of components and give the users a helpful error otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tell me if I'm throwing a wrench into things a bit too much. I just wonder if we can streamline a lot of this.
One last thing to check is whether we want to enforce that both components are of the same type when we try to create the mapping as users have probably made a mistake if they want to map a SMC to a PC as they should not by mutating that many atoms? |
Yeah I think it's reasonable to expect the two input molecules to be of the same type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions / suggestions and then I think we're good to go.
raise ValueError(f"The components {A} and {B} were not of the same type, please check the inputs.") | ||
# 1. identify Component Chains if present | ||
component_a_chains = KartografAtomMapper._split_component_molecules(A) | ||
component_b_chains = KartografAtomMapper._split_component_molecules(B) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the length of these chains is not equal? i.e. should we guard against that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes probably but this would also stop the case when one has more waters (or some other molecule) than the other which should probably still work, but maybe its simpler if we just ensure they are the same length for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial reaction is "in those case I would expect those bits to be mapped as appearing/disappearing".
I recognise that more discussion is probably necessary - should we maybe add the length check for now, open up an issue to review it, and have a discussing at a protocol devs meeting to see what folks would like to get from this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to block having different numbers of components for now and we will come back to this in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 00709d5
# Conflicts: # src/kartograf/tests/conftest.py # src/kartograf/tests/test_atom_mapper.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's punt the "this could lead to clashes in keys/values" to another issue.
"mapping": largest_overlap_map | ||
} | ||
# At the end of the loop mapping_obj should have the largest map overlap | ||
largest_mappings.append(mapping_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this could allow you to have non-unique keys/values.
This PR tries to solve the raised issue with multi chain components.
see #46