Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ag-866/Ag-1227 Process SRM DE data #90

Merged
merged 11 commits into from
Nov 3, 2023
Merged

Conversation

jaclynbeck-sage
Copy link
Contributor

@jaclynbeck-sage jaclynbeck-sage commented Sep 30, 2023

This PR pre-processes the SRM DE data and adds it to the gene_info and proteomics_distribution transforms, and adds an entry for SRM data to be processed (w/ no custom transform needed) in the config, the same way LFQ and TMT data is processed.

For pre-processing: The notebook that does this pre-processing is here. The notebook combines two SRM proteomics DE files into one dataframe and reformats them to mimic the LFQ and TMT file formats. The input data only contains gene names, so Uniprot and Ensembl IDs are queried for each gene and added. p-values in the data are corrected for multiple testing, then columns are re-named and re-ordered to match LFQ and TMT data. At that point, it was really simple to add this data to the transforms because it's identical in format to the other two proteomics files.

I confirmed that the new proteomics_distribution_data.json has a single JSON entry added for SRM data, and that the only differences in the new gene_info.json file are that several genes have flipped is_any_protein_changed_in_ad_brain and protein_brain_change_studied from false to true due to significance in the SRM data for that gene.

@jaclynbeck-sage jaclynbeck-sage marked this pull request as ready for review September 30, 2023 02:09
@JessterB
Copy link
Contributor

JessterB commented Oct 6, 2023

@jaclynbeck-sage We do actually use those ci values in the GCT proteomics circle overlay plot. If we can't get actual ci values for this data somehow, setting them to the l2fc value won't work.

  1. Can you think of a way to generate or otherwise acquire ci values for this data?
  2. If not, is setting the ci values to 0 an option?

@jaclynbeck-sage
Copy link
Contributor Author

Good catch. We don't have the raw data but apparently the CI can be calculated from the p-value so I'll do that! Sorry, I didn't look into that originally.

Also I just realized I did not add SRM to our proteomics tests so I will add those as well.

@jaclynbeck-sage
Copy link
Contributor Author

@JessterB I am still working on getting SRM into the testing suite, but I've updated the SRM data to have confidence intervals. This turned out to be non-trivial for an ANOVA on multiple groups, and I ended up having to partially re-process the raw data. The log2-fold-change values and p-values are identical to the data in the original DE tables, but now there are confidence intervals too.

This brings up a thought -- how much of this processing actually needs to happen in this pre-processing notebook and how much should instead be in a separate repository or gist, like the LFQ and TMT processing? I don't know how much it matters that we're doing (partial) DE analysis inside a notebook in the ADT repository.

The only part of this process that is specifically ADT pipeline-related that actually needs to be in a preprocessing notebook is the UniProt/Ensembl ID lookup (due to potential failure/hanging when making external API requests). The rest is just data rearrangement and math.

I could see, for example, the UniProt -> Ensembl ID mapping being done in this notebook, and a separate repo/gist re-processes the DE data, and then a transform combines the two. On the other hand, leaving it the way it currently works requires no extra work and no extra transforms.

For reference, a human-readable version of the notebook is here.

@jaclynbeck-sage
Copy link
Contributor Author

@JessterB ready for re-review! SRM has been added to the integration testing, and confidence intervals are real values.

As per our discussion yesterday, we'll be leaving all the data processing as-is in the notebook until/unless we need to restructure stuff.

Copy link
Contributor

@JessterB JessterB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@JessterB JessterB merged commit c7d64fa into dev Nov 3, 2023
7 checks passed
@JessterB JessterB deleted the AG-866_reprocess_srm_data branch November 3, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants