-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ag-866/Ag-1227 Process SRM DE data #90
Conversation
…to AG-866_reprocess_srm_data
… data to gene_info, proteomics, and proteomics_distribution transforms
@jaclynbeck-sage We do actually use those ci values in the GCT proteomics circle overlay plot. If we can't get actual ci values for this data somehow, setting them to the l2fc value won't work.
|
Good catch. We don't have the raw data but apparently the CI can be calculated from the p-value so I'll do that! Sorry, I didn't look into that originally. Also I just realized I did not add SRM to our proteomics tests so I will add those as well. |
tervals and add more documentation
…d instead added the mapping file to Synapse
@JessterB I am still working on getting SRM into the testing suite, but I've updated the SRM data to have confidence intervals. This turned out to be non-trivial for an ANOVA on multiple groups, and I ended up having to partially re-process the raw data. The log2-fold-change values and p-values are identical to the data in the original DE tables, but now there are confidence intervals too. This brings up a thought -- how much of this processing actually needs to happen in this pre-processing notebook and how much should instead be in a separate repository or gist, like the LFQ and TMT processing? I don't know how much it matters that we're doing (partial) DE analysis inside a notebook in the ADT repository. The only part of this process that is specifically ADT pipeline-related that actually needs to be in a preprocessing notebook is the UniProt/Ensembl ID lookup (due to potential failure/hanging when making external API requests). The rest is just data rearrangement and math. I could see, for example, the UniProt -> Ensembl ID mapping being done in this notebook, and a separate repo/gist re-processes the DE data, and then a transform combines the two. On the other hand, leaving it the way it currently works requires no extra work and no extra transforms. For reference, a human-readable version of the notebook is here. |
@JessterB ready for re-review! SRM has been added to the integration testing, and confidence intervals are real values. As per our discussion yesterday, we'll be leaving all the data processing as-is in the notebook until/unless we need to restructure stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
This PR pre-processes the SRM DE data and adds it to the
gene_info
andproteomics_distribution
transforms, and adds an entry for SRM data to be processed (w/ no custom transform needed) in the config, the same way LFQ and TMT data is processed.For pre-processing: The notebook that does this pre-processing is here. The notebook combines two SRM proteomics DE files into one dataframe and reformats them to mimic the LFQ and TMT file formats. The input data only contains gene names, so Uniprot and Ensembl IDs are queried for each gene and added. p-values in the data are corrected for multiple testing, then columns are re-named and re-ordered to match LFQ and TMT data. At that point, it was really simple to add this data to the transforms because it's identical in format to the other two proteomics files.
I confirmed that the new
proteomics_distribution_data.json
has a single JSON entry added for SRM data, and that the only differences in the newgene_info.json
file are that several genes have flippedis_any_protein_changed_in_ad_brain
andprotein_brain_change_studied
fromfalse
totrue
due to significance in the SRM data for that gene.