Pipeline Interface for Prediction tools #204

christopher-mohr · 2023-07-24T06:29:26Z

christopher-mohr
Jul 24, 2023
Maintainer

This discussion is about a potential interface for prediction tools within the pipeline and the associated changes, such as associated modules, that would be necessary for this change.

jonasscheid · 2023-07-24T08:14:59Z

jonasscheid
Jul 24, 2023
Collaborator

Here my draft how the mhc-binding structure could look like:
It should be a subworkflow that replaces the function make_predictions_from_peptides in epaa.py
That subworkflow takes as input a tsv file containing a column sequence (and additional metadata). Depending on the tools specified by the user one or more predictors can be used.

A prepare module that extracts only the sequence informations, checks for supported min/max pep length and supported alleles of each predictor by loading a predictor-specific config file where this information is stored. Also HLA nomenclature is adjusted to predictor-specific nomenclature. For this we could simply use mhcgnomes, or implement a parsing function. Then it generates for each predictor a specific input file. This can be handled with a python script in a very simple class structure. (container: either mhcgnomes or pandas)
MHCnuggets cannot be called from the command line so we need a py script to execute it
MHCflurry can be called via CLI by first downloading the model and then predict
NetMHC(II)pan models are loaded in a separate module (since its quite complex) and can then be called via CLI in a second module
Syfpeithi needs to be executed via a python script loading the respective epytope classes
Finally, the subworkflow ends with a module harmonising the prediction inputs and reannotating the input metadata for each peptide again. I would opt for a wide-format here, where all peptide predictions of the given alleles are collapsed in one row (could also provide option for long-format or vice versa).

0 replies

christopher-mohr · 2023-07-25T11:10:06Z

christopher-mohr
Jul 25, 2023
Maintainer Author

Thanks for sharing the draft. A couple of thoughts on the information you provided:

In my opinion, the most important part is the one on the prediction methods and we should have (nf-core) modules for them. Local modules are fine for me as a start.

mhcnuggets: we can also execute the python code directly in the module (no need for a separate script)
mhcflurry: should be a relatively simple module
NetMHCpan: rather complex, hard to say anything with the level of detailed provided, the loading models part is already there I would say or what do you mean by that in detail?

These prediction tool modules e.g. can also be developed completely independent of the other functionality.

I am still not sure about the added benefit of using

mhcgnomes, or implement a parsing function

when we have allele-handling functionality provided by epytope. Maybe you can add some more information why this is necessary.

Regarding the output, I assume you mean the format as we have it now? I think we should aim for that to not have different output formats.

In general, I still see the difficulties when everything is done "string/table"-based, especially with respect to maintaing metadata. However, for peptides it's a smaller issue than for variants and proteins.

1 reply

jonasscheid Jul 25, 2023
Collaborator

mhcnuggets: we can also execute the python code directly in the module (no need for a separate script)

Agreed ✅

NetMHCpan: rather complex, hard to say anything with the level of detailed provided, the loading models part is already there I would say or what do you mean by that in detail?

Yes I think we can use the loading module as is 👍🏼 Would you also vote for a subworkflow for netmhcpan, as @marissaDubbelaar suggested, containing loading the models and executing the predictor?

when we have allele-handling functionality provided by epytope. Maybe you can add some more information why this is necessary.

If we add novel predictors (that might not be executable in epytope because of diverging dependencies) we would not need to add an interface in epytope and implement all the required prediction functionality, which we do won't use anymore based on this draft. Easier also for newcomers..

Regarding the output, I assume you mean the format as we have it now? I think we should aim for that to not have different output formats.

Yes exactly, I would argue against the current output. If you think of having multiple predictors in long-format (-> 1 peptide, multiple rows per predictor) and the peptide comes with a lot of metadata attached to it, you would annotate that metadata to multiple rows and accumulate a lot of duplication in your results. With wide-format you would not have this duplication.

christopher-mohr · 2023-07-25T15:47:01Z

christopher-mohr
Jul 25, 2023
Maintainer Author

Yes I think we can use the loading module as is 👍🏼 Would you also vote for a subworkflow for netmhcpan, as @marissaDubbelaar suggested, containing loading the models and executing the predictor?

Not sure if we need a subworkflow for two module calls more or less but it's okay for me.

If we add novel predictors (that might not be executable in epytope because of diverging dependencies) we would not need to add an interface in epytope and implement all the required prediction functionality, which we do won't use anymore based on this draft. Easier also for newcomers.

For new methods (with new requirements) I see your point. My guess would be though that we do not have that much variety when it comes to the required allele notation.

Yes exactly, I would argue against the current output. If you think of having multiple predictors in long-format (-> 1 peptide, multiple rows per predictor) and the peptide comes with a lot of metadata attached to it, you would annotate that metadata to multiple rows and accumulate a lot of duplication in your results. With wide-format you would not have this duplication.

Then I did not understand it correctly. :)
We had such a format couple of years back but changed it at some point. I am not sure if the duplication is an issue. I would definietly prefer >> rows over >> columns. So I am not a fan of 3 * #alleles * #prediction methods columns. I would also say it's easier to filter for the prediction of specific tools e.g. which is not (easily) possible anymore with the other format without any reformatting. What would be your concern exactly about the duplication? Any use case that is affected by that?

0 replies

jonasscheid · 2023-07-26T05:55:59Z

jonasscheid
Jul 26, 2023
Collaborator

For new methods (with new requirements) I see your point. My guess would be though that we do not have that much variety when it comes to the required allele notation.

Ok, so if you also think that is not an issue, we could try mhcgnomes for the draft? It is quite handy when it comes to different conversions for class 2 (Mouse I did not check in detail, but is also supported). It also comes with pandas in the container 🙌🏼

What would be your concern exactly about the duplication? Any use case that is affected by that?

Ad-hoc use case is that we have in mass spec data, where each peptide is annotated with >20 columns of search scores and quality measurements of each peptide. I think wide-format can be a bit ugly yes but is a bit easier to read because you don't need to consider x rows for x predictors to understand what is a binder over all predictors (imo). You could also easily compute summary columns (e.g. consensus of all given predictors).
Btw: Isn't it "only" #alleles * #prediction methods 🤔 ?

0 replies

christopher-mohr · 2023-07-26T06:44:59Z

christopher-mohr
Jul 26, 2023
Maintainer Author

Ok, so if you also think that is not an issue, we could try mhcgnomes for the draft? It is quite handy when it comes to different conversions for class 2 (Mouse I did not check in detail, but is also supported). It also comes with pandas in the container 🙌🏼

✅ but for me as well an independent development unit that is not necessarily bound to the story of having the prediction tool interface within the pipeline. So this could be e.g. done in a separate task/PR in order to keep the number of changes/newly introduced features in a single PR lower.

Ad-hoc use case is that we have in mass spec data, where each peptide is annotated with >20 columns of search scores and quality measurements of each peptide. I think wide-format can be a bit ugly yes but is a bit easier to read because you don't need to consider x rows for x predictors to understand what is a binder over all predictors (imo). You could also easily compute summary columns (e.g. consensus of all given predictors). Btw: Isn't it "only" #alleles * #prediction methods 🤔 ?

I think we even changed it back (to the long format) based on feedback from people that were doing the peptide selection in Excel. My suggestion would be as you also mentioned earlier to have a switch which one can use to switch to wide format (--wide_output_format something like that). Should be *3 because of the columns "affinity/score/binder". :)

0 replies

jonasscheid · 2023-07-26T09:00:04Z

jonasscheid
Jul 26, 2023
Collaborator

I think we even changed it back (to the long format) based on feedback from people that were doing the peptide selection in Excel. My suggestion would be as you also mentioned earlier to have a switch which one can use to switch to wide format (--wide_output_format something like that).

✅ Maybe a quick survey in the nf-core slack channel might help defining the default one?

Should be *3 because of the columns "affinity/score/binder". :)

Ah yes sorry! Another cosmetic request / suggestion I would throw out here to reduce the number of (unnecessary) columns: Should we keep the boolean "binder" column? Also: Should we keep the affinities in the final output since everyone is using the rank metric since quite some time? If yes, we could keep save the affinity scores in the module output of each predictor but the final harmonised results doesn't have. That might be resolve overwhelming details for the user and if they are still interested in the affinities they could look up the "raw" prediction result

0 replies

christopher-mohr · 2023-07-26T16:08:36Z

christopher-mohr
Jul 26, 2023
Maintainer Author

✅ Maybe a quick survey in the nf-core slack channel might help defining the default one?

Sure, one vote per lab? 😄

Ah yes sorry! Another cosmetic request / suggestion I would throw out here to reduce the number of (unnecessary) columns: Should we keep the boolean "binder" column? Also: Should we keep the affinities in the final output since everyone is using the rank metric since quite some time? If yes, we could keep save the affinity scores in the module output of each predictor but the final harmonised results doesn't have. That might be resolve overwhelming details for the user and if they are still interested in the affinities they could look up the "raw" prediction result

Hm, I currently don't see any problem with the binder column, just advantages if people look at the results via Excel etc and want to have a quick solution for filtering. Isn't rank currently provided as score ? In my opinion it still makes sense to keep both affinity + score/rank and present that information to the user. I see that it might not be needed in some use cases but one does not have to use it which is less effort than looking it up in a different file/ joining tables.

4 replies

jonasscheid Jul 26, 2023
Collaborator

I see your point, but what are the use cases for 6 Boolean columns? Maybe you want only one column comprising: any(is allele-scores passing the threshold?), but maybe not even. In RNAseq / differentialabundance there is also not a boolean column for DEgenes, you just filter by the known threshold (padj < 0.05 etc)

Also when we talk about use cases, you would only define binders based on one metric (almost exclusively affinity rank/percentile) or what would be the benefit of using rank + affinity?

Just bringing this up because I think it makes sense to reduce the result size to a minimum to make it more readable and less confusing for the user

christopher-mohr Jul 27, 2023
Maintainer Author

The use case is to be able to filter on a per allele basis as well (i.e. only show pep1 if binder for allele x). Not really comparable to RNAseq in any way in my opinion. As mentioned in my earlier post, affinity could still be interesting for people (e.g. what is the affinity of peptides predicted as binder vs. affinity of peptides predicted as non-binder, ...). Therefore, I would stil provide this information. I guess it should be no big deal for you to remove these columns afterwards.

Besides, I don't really understand why this is such an important topic to you and it is not even slightly related to the story of having a "Pipeline Interface for Prediction Tools".

jonasscheid Jul 27, 2023
Collaborator

It's not important, just want a clear definition what should be implemented to rule out more points of discussion later on during implementation, since it was requested to display my suggestions in detail. But then lets keep the result structure as-is and we start implementing 👍🏼

christopher-mohr Jul 27, 2023
Maintainer Author

It would be good to have issues for the individual parts (e.g. mhcnuggets module), that can then also be linked to the individual PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline Interface for Prediction tools #204

{{title}}

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Pipeline Interface for Prediction tools #204

christopher-mohr Jul 24, 2023 Maintainer

Replies: 7 comments · 5 replies

jonasscheid Jul 24, 2023 Collaborator

christopher-mohr Jul 25, 2023 Maintainer Author

jonasscheid Jul 25, 2023 Collaborator

christopher-mohr Jul 25, 2023 Maintainer Author

jonasscheid Jul 26, 2023 Collaborator

christopher-mohr Jul 26, 2023 Maintainer Author

jonasscheid Jul 26, 2023 Collaborator

christopher-mohr Jul 26, 2023 Maintainer Author

jonasscheid Jul 26, 2023 Collaborator

christopher-mohr Jul 27, 2023 Maintainer Author

jonasscheid Jul 27, 2023 Collaborator

christopher-mohr Jul 27, 2023 Maintainer Author

christopher-mohr
Jul 24, 2023
Maintainer

Replies: 7 comments 5 replies

jonasscheid
Jul 24, 2023
Collaborator

christopher-mohr
Jul 25, 2023
Maintainer Author

jonasscheid Jul 25, 2023
Collaborator

christopher-mohr
Jul 25, 2023
Maintainer Author

jonasscheid
Jul 26, 2023
Collaborator

christopher-mohr
Jul 26, 2023
Maintainer Author

jonasscheid
Jul 26, 2023
Collaborator

christopher-mohr
Jul 26, 2023
Maintainer Author

jonasscheid Jul 26, 2023
Collaborator

christopher-mohr Jul 27, 2023
Maintainer Author

jonasscheid Jul 27, 2023
Collaborator

christopher-mohr Jul 27, 2023
Maintainer Author