Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping multi chain components #47

Merged
merged 23 commits into from
Nov 25, 2024
Merged

Conversation

RiesBen
Copy link
Contributor

@RiesBen RiesBen commented Aug 28, 2024

This PR tries to solve the raised issue with multi chain components.
see #46

@RiesBen RiesBen linked an issue Aug 28, 2024 that may be closed by this pull request
@pep8speaks
Copy link

pep8speaks commented Aug 28, 2024

Hello @RiesBen! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 857:80: E501 line too long (103 > 79 characters)
Line 859:80: E501 line too long (92 > 79 characters)
Line 860:80: E501 line too long (93 > 79 characters)
Line 863:80: E501 line too long (98 > 79 characters)
Line 877:80: E501 line too long (82 > 79 characters)
Line 883:80: E501 line too long (86 > 79 characters)
Line 884:80: E501 line too long (86 > 79 characters)
Line 917:80: E501 line too long (82 > 79 characters)
Line 919:80: E501 line too long (95 > 79 characters)
Line 927:80: E501 line too long (92 > 79 characters)
Line 928:80: E501 line too long (104 > 79 characters)
Line 929:80: E501 line too long (88 > 79 characters)
Line 932:80: E501 line too long (85 > 79 characters)
Line 937:80: E501 line too long (83 > 79 characters)

Line 116:80: E501 line too long (89 > 79 characters)
Line 123:80: E501 line too long (84 > 79 characters)
Line 125:80: E501 line too long (90 > 79 characters)

Line 285:80: E501 line too long (94 > 79 characters)
Line 290:80: E501 line too long (100 > 79 characters)
Line 300:80: E501 line too long (97 > 79 characters)
Line 303:80: E501 line too long (84 > 79 characters)
Line 307:80: E501 line too long (95 > 79 characters)
Line 308:1: W293 blank line contains whitespace
Line 315:80: E501 line too long (93 > 79 characters)
Line 327:80: E501 line too long (96 > 79 characters)
Line 335:80: E501 line too long (97 > 79 characters)

Comment last updated at 2024-09-26 02:53:50 UTC

src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.53%. Comparing base (8ebfea7) to head (00709d5).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
+ Coverage   96.60%   97.53%   +0.93%     
==========================================
  Files          13       13              
  Lines         618      649      +31     
==========================================
+ Hits          597      633      +36     
+ Misses         21       16       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

RiesBen and others added 2 commits August 29, 2024 09:02
suggesting implementation for _split_protein_component_chains
@ijpulidos
Copy link
Contributor

ijpulidos commented Sep 20, 2024

I've also realized that the way we are splitting by chains might not doing what we want, I think we would like to go with a similar approach to what the gufe.ProteinComponent._from_openMMPDBFile is doing, but instead using the chain atoms instead of the topology atoms.

For example, the structure of TYK2 in PLB repo has two waters (6 atoms in total), and if you check this approach you have something like the following:

In [4]: tyk2_comp = ProteinComponent.from_pdb_file(f"{tyk2_basepath}/protein.pdb")

In [5]: tyk2_rdmol = tyk2_comp.to_rdkit()

In [6]: tyk2_rdmol.GetNumAtoms()
Out[6]: 4658

In [7]: mapper = KartografAtomMapper(atom_map_hydrogens=True)

In [8]: chains = mapper._split_protein_component_chains(tyk2_comp)

In [9]: chains
Out[9]: [ProteinComponent(name=0_A), ProteinComponent(name=1_A)]

In [10]: chains[1].to_rdkit().GetNumAtoms()
Out[10]: 4

In [11]: chains[0].to_rdkit().GetNumAtoms()
Out[11]: 4652

So there are some missing atoms in the waters when using this function to split the components by chain.

@ijpulidos
Copy link
Contributor

It seems that there is a different behavior for importlib.resources.file in python 3.9. That's why the tests are failing. I couldn't spot anything about this in the changelog for 3.10, though.

for mapping_obj in largest_mappings:
start_a = int(mapping_obj.componentA.name.split("_")[-1])
start_b = int(mapping_obj.componentB.name.split("_")[-1])
shifted_map = {a_idx + start_a: b_idx + start_b for a_idx, b_idx in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note here that we have a little bit of a footgun here, when we shift the indices and we update the dictionary it is possible that some of the indices get overwritten and that means that probably something went wrong.

I don't know what's a good solution for this, but maybe we should think about having yet another class that handles this itself, maybe inheriting from dict and throwing an exception when a __setitem__ overwrites something that already exists. Just a guess at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even at the expense of non-fancy code & extra cost, it might be good to do a check on the indices and make sure that there's no dupllicates. In its simplest form, just a loop where you create the two lists, turn them into sets and see if the length changed?

@ijpulidos ijpulidos changed the title [WIP] Mapping multi chain components Mapping multi chain components Sep 26, 2024
@ijpulidos ijpulidos mentioned this pull request Sep 26, 2024
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial review / discussion points.

atom_index = atom.GetIdx()
if not (atom_index in index_tuple):
remove_indices.append(atom_index)
# Need to remove separately https://github.com/rdkit/rdkit/issues/1366
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that this is because the atom ids get re-assigned on the fly?

From some pen and paper playing around, I think this is should work in all cases - are you reasonably confident of this too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my understanding. It is happening on the fly, so the iterator gets invalidated and the behavior is undefined.

for atom_idx in sorted(remove_indices, reverse=True):
edit_rdmol_frag.RemoveAtom(atom_idx)
# Create component with the remaining molecule
frag_rdmol = edit_rdmol_frag.GetMol()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do anything about bond orders? I.e. do we know if removing the atoms also re-adjusts the bonds in the molecule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a test that checks the bond orders fror the components being returned would be useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good check to do. Yes.

src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
for mapping_obj in largest_mappings:
start_a = int(mapping_obj.componentA.name.split("_")[-1])
start_b = int(mapping_obj.componentB.name.split("_")[-1])
shifted_map = {a_idx + start_a: b_idx + start_b for a_idx, b_idx in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even at the expense of non-fancy code & extra cost, it might be good to do a check on the indices and make sure that there's no dupllicates. In its simplest form, just a loop where you create the two lists, turn them into sets and see if the length changed?

@IAlibay
Copy link
Member

IAlibay commented Oct 22, 2024

@RiesBen - having @hannahbaumann @jthorton take over the review of this PR might be a good handover exercise. This seems like the type of thing that would expose folks to most of the Kartograf functionality.

Copy link
Contributor

@jthorton jthorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far, the only blocking change would be to separate out the protein protein specific logic into its own function the other feedback should be considered optional.

Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Really love the performance improvements and cleaner code. I added a few comments that I think we should address.

I just realized that we haven't really dealt with the case where an user tries to do a mapping with mixed components (such as a mapping between a ProteinComponent and a SmallMoleculeComponent). This has the potential to face combinatorial explosion. Maybe we should just support mappings between the same types of components and give the users a helpful error otherwise.

src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tell me if I'm throwing a wrench into things a bit too much. I just wonder if we can streamline a lot of this.

src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
@jthorton
Copy link
Contributor

One last thing to check is whether we want to enforce that both components are of the same type when we try to create the mapping as users have probably made a mistake if they want to map a SMC to a PC as they should not by mutating that many atoms?

@IAlibay
Copy link
Member

IAlibay commented Nov 12, 2024

One last thing to check is whether we want to enforce that both components are of the same type when we try to create the mapping as users have probably made a mistake if they want to map a SMC to a PC as they should not by mutating that many atoms?

Yeah I think it's reasonable to expect the two input molecules to be of the same type.

Copy link
Contributor

@ijpulidos ijpulidos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! LGTM.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions / suggestions and then I think we're good to go.

src/kartograf/atom_mapper.py Outdated Show resolved Hide resolved
raise ValueError(f"The components {A} and {B} were not of the same type, please check the inputs.")
# 1. identify Component Chains if present
component_a_chains = KartografAtomMapper._split_component_molecules(A)
component_b_chains = KartografAtomMapper._split_component_molecules(B)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the length of these chains is not equal? i.e. should we guard against that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes probably but this would also stop the case when one has more waters (or some other molecule) than the other which should probably still work, but maybe its simpler if we just ensure they are the same length for now?

Copy link
Member

@IAlibay IAlibay Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial reaction is "in those case I would expect those bits to be mapped as appearing/disappearing".

I recognise that more discussion is probably necessary - should we maybe add the length check for now, open up an issue to review it, and have a discussing at a protocol devs meeting to see what folks would like to get from this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to block having different numbers of components for now and we will come back to this in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 00709d5

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's punt the "this could lead to clashes in keys/values" to another issue.

"mapping": largest_overlap_map
}
# At the end of the loop mapping_obj should have the largest map overlap
largest_mappings.append(mapping_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this could allow you to have non-unique keys/values.

@jthorton jthorton self-requested a review November 25, 2024 11:12
@jthorton jthorton merged commit 4532c95 into main Nov 25, 2024
7 checks passed
@jthorton jthorton deleted the 46-mapping-multimer-protein-components branch November 25, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mapping multimer protein components
5 participants