-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add new Ion-mobility peak picking algorithm #647
base: devel
Are you sure you want to change the base?
Conversation
Add a new function for ion-mobility peak picking, compatible with `MsExperiment`'s `Spectra` only (`TimsTofBackend`), with a corresponding CentWaveParamIM class: - Create `do_findChromPeaks_IM_centWave`, the API for the peakpicking itself - Create a `IMParam`class that inherits from `Param` and that will serve as superclass for all ion-mobility peak picking Params. - Create a `CentWaveParamIM` that inherits from both `CentWaveParam` and `IMParam` - Add a function dispatch point in .mse_find_chrom_peaks_sample - Create documentation for both API and param class
For better readability and indexing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Roger! looks good, but in order to give better advise I need to better understand what the other algorithms you are thinking to implement do.
For just this algorithm I would suggest to put most of the processing directly into .mse_find_chrom_peaks_chunk
, i.e. put the code do create the matrix
p
there into a function that takes also param
as input. For IM param classes it could then do the collapsing by frame and return directly the p
that can be used for the original do_...
. Then, have another function that takes the result from the chrom peak detection and post-processes that. That function could also take param
as input, and, if no IM param is provided simply returns the peak matrix. Otherwise, it performs this additional peak detection that you have in your do_findChromPeaks*
for IM data.
Happy to discuss :)
R/do_findChromPeaks-functions.R
Outdated
|
||
|
||
## Merging frames into scans and Summarize across IM dimension | ||
message("Collapsing data over IM dimension... ", appendLF = F) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a comment: I would avoid using these message
calls - I try to reduce them as I found them pretty annoying if you have many files. Instead, we're now using a progressbar for the complete processing. that looks somewhat cleaner.
#' @author Roger Gine, Johannes Rainer | ||
#' | ||
#' @importFrom Spectra peaksData rtime combineSpectra mz | ||
do_findChromPeaks_IM_centWave <- function(spec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other ways/algorithms to perform peak detection on IM data that do not first collapse the data and then expand it again like you do here?
If not I would suggest to split the functionality into 3 functions:
- collapse
peaksData
by frame - do peak detection using the original detection algorithm.
- have a function that takes the original
peaksData
and the detected chrom peaks matrix from 2) as input and post processes the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, yes, there are other ways to perform peak-picking that directly use the "full" data without collapsing (for instance, you could do a 2D-CWT where you change the scales in both RT and IM dimensions and find local maxima; or any peakpicking algorithm such as those used for GCxGC-MS, where you have a similar situation). I haven't looked deeply into such methods, but we should accomodate for them too, just in case
Still, splitting the functionality in do_findChromPeaks_IM_centWave
seems good, since those functions would be reusable for other algorithms, etc. Specifically, if you agree, I'll do the following:
- Call
.mse_find_chrom_peaks_sample
with the IM-collapsedSpectra
object and the paramas(param, CentWaveParam")
(since itIMCentWaveParam
inherits from it) -> That function will take care of extracting the peaksData, rt, valsPerSpect, etc., rundo_findChromPeaks_centWave
, and return the peak matrix. - Encapsulate the post processing (ie. resolving across IM dimension) in another function
Sounds good? 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for my late reply!
splitting functionality is always good, as you say, enables reuse - and makes the code easier to read. so, yes, it sounds good.
so, if I understand:
- if
param
inheritsIMCentWaveParam
collapse thepeaksData
by frame. - pass the data to
do_findChromPeaks_centWave
for peak detection (sinceIMCentWaveParam
inheritsCentWaveParam
) - if
param
inheritsIMCentWaveParam
post process the detected peaks (resolve across IM dimension) and return results.
if the functions get to large, you could also consider implementing a .im_mse_find_chrom_peaks_sample
that is called instead of .mse_find_chrom_peaks_sample
if param
inherits from an IM
param object... not sure if that would simplify integration of additional/other IM peak detection methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've encapsulated all this into do_findChromPeaks_IM_centWave
, so it's called after dispatching the function corresponding to the param
:
## Merging all frames from the same scans to summarize across IM dimension
scans_summarized <-
Spectra::combineSpectra(
spec,
f = as.factor(spec$frameId),
intensityFun = base::sum,
weighted = TRUE,
ppm = ppmMerging
)
Spectra::centroided(scans_summarized) <- TRUE
## 1D Peak-picking on summarized data
peaks <- .mse_find_chrom_peaks_sample(scans_summarized,
msLevel = 1L,
param = CentWaveParam(ppm = ppm, peakwidth = peakwidth,
snthresh = snthresh, prefilter = prefilter,
mzCenterFun = mzCenterFun, integrate = integrate,
mzdiff = mzdiff, fitgauss = fitgauss, noise = noise,
verboseColumns = verboseColumns, roiList = roiList,
firstBaselineCheck = firstBaselineCheck,
roiScales = roiScales,
extendLengthMSW = extendLengthMSW))
## 1D Peak-picking, for each individual peak, to resolve across the IM dimension
.do_resolve_IM_peaks_CWT(spec, peaks, binWidthIM)
It's basically the all steps you mentioned (collapse, CentWave and resolve), but called from a "lower" function call level, so everything upstream is more tidy. What I like about this is that all Centwave-specific checks and data extraction (mz, int, valsPerSpect, etc.) are handled by .mse_find_chrom_peaks_sample
, so we are reusing already-existing code.
I'll commit the proposed refactor so you can take a closer look by yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. Let me know when you're OK/ready from your side. I would then like to try the code in action and tinker myself a bit to see if/where we could improve/optimize.
For that I would create yet another branch to play with the code and ask for your feedback on the merge.
Related to that: could you please provide a short code snipped with the example how to perform the analysis (I guess I got already a file from you on which I can test...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! If you agree, perhaps we should use (for testing) the MsBackendMemory
or MsBackendDataFrame
to create a manageable toy example from the current read-only MsBackendTimsTof
(for instance, subsetting the RT)
I'm still figuring out how to properly centroid the data (the format has some problems that makes the current peakPicks
function in Spectra
ineffective, see below), but still, the analysis would go somewhat like this:
library(opentimsr)
library(MsBackendTimsTof)
library(MsExperiment)
library(Spectra)
library(xcms)
library(magrittr)
## Setting up Bruker SDK to read the raw data directly (if you have it already, just point to the corresponding file)
so_folder <- tempdir()
so_file <- download_bruker_proprietary_code(so_folder, method = "wget")
setup_bruker_so(so_file)
## Set up valid MsExperiment with the subsetted spectra, then detect IM peaks
be <- backendInitialize(MsBackendTimsTof(), "./path_to_your_file_folder.d")
spec <- Spectra(be) %>%
filterRt(., c(350, 365)) %>%
filterMsLevel(.. 1) %>%
setBackend(., MsBackendMemory())
exp <- MsExperiment()
spectra(exp) <- spec
sampleData(exp) <- DataFrame(
raw_file = normalizePath("./path_to_your_file_folder.d")
)
exp <- linkSampleData(exp, with = "sampleData.raw_file = spectra.dataOrigin")
exp <- findChromPeaks(exp, IMCentWaveParam())
You can use the TIMS-TOF data file I sent you a while back, just bear in mind it's in a zero-less profile mode (working on fixing that) and the chromatographic peaks are usually very short (<6-10s)
R/do_findChromPeaks-functions.R
Outdated
.extract_mobilogram <- function(pdata, peak, rt, im, binWidthIM = 0.01){ | ||
rtr <- c(peak["rtmin"], peak["rtmax"]) | ||
mzr <- c(peak["mzmin"], peak["mzmax"]) | ||
keep <- dplyr::between(rt, rtr[1], rtr[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using MsCoreUtils::between
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried it, but the performance was much better for dplyr::between
, around 2 times faster
Still, since rt is a sorted vector (by default I think, Spectra wouldn't allow it otherwise), using base::findInterval
would be even faster. Here's a representative benchmark:
library(microbenchmark)
rt <- rep(1:2000, each = 1000) ## Big vector, like the ones you'd have for IM scans
rtint <- c(230, 260)
microbenchmark(MsCoreUtils::between(rt, rtint),
dplyr::between(rt, rtint[1], rtint[2]),
findInterval(rtint, rt), times = 1000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow - indeed. note that rtime
does not necessarily need to be ordered - Spectra
is relatively relaxed there (while mz
values indeed are required to be sorted). I'll open an issue in MsCoreUtils
- no need to have a between
function there if there is a more powerful in dplyr
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I got it the other way around with the rtime
and mz
, I thought rtime
also had to be ordered: then, could we just order the spectra by rtime
first so we can use the faster function findInterval
?
Something like this, at some point:
spec <- spec[order(rtime(spec))]
I like the fact that the MsCoreUtils::between
uses a numeric(2)
as input and sorts it before using it (it's neat and avoids silly mistakes). I'll comment that on the corresponding issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it would be best to start using the MsCoreUtils::between
again - Sebastian did some improvement in a C implementation of the code. So it should now be comparable (and eventually even faster) than dplyr::between
.
Regarding findInterval
and ordering the rtime
- you're right, rts should be ordered, but I would like to avoid checking for the order or enforcing an order... better to have the code as independent of possible from potential pre-requisites (that we would then always need to check if they are true within the function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good! I'll revert back to MsCoreUtils::between
then 👍
next | ||
} | ||
bounds <- .split_mobilogram(mobilogram) | ||
new_peaks <- data.frame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using data.frame
here might have an impact on performance. I think matrix
might be faster, but that's something we can check later.
Hi Johannes, thanks for the code review!
For now: I'll make the amendments we discussed above, profile the algorithm performance, and create some unit tests for all new functions |
Perfect! Thanks @RogerGinBer for the updates! And yes, we need to keep track of the IM dimension data for the identified peaks - that information could either go directly (as a new column) into the |
Good! I think I'll keep the IM ranges in the |
- Remove `message` calls from within the function - Split functionality in 3 parts: 1. collapse/merge IM scans, 2. call CentWave on collapsed data, 3. post-process peak matrix to resolve IM peaks. - Add termination `CWT` to `.do_resolve_IM_peaks` and `split_mobilogram` to differenciate them if more methods are added
Something else: could you please already start to add unit tests? it's usually better to implement them right away for each function you write. |
Add the dispatch point to `.mse_find_chrom_peaks_chunk` Until now the IM peakpicking was only available through .mse_find_chrom_peaks_sample
Yep, you're right, I completely forgot about that! 🙏 Also, I found out that the raw IM data I was reading is in profile mode but does not have any flanking zeroes, so the |
Thanks for the update - regarding the test/toy data set: I think a |
Since we use `combineSpectra` to merge mobility scans into a frame-level scan, the reported mz value *shrinks* to the weigthed average, causing problems downstream when trying to calculate the mobilograms. Therefore, we are expanding by `ppmMerging` the peak mz range after CentWave peak detection Also, add checks for number of rows of output
Add a compact IM data file for running tests Add tests for `do_findChromPeaks_IM_centWave`, internals `.extract_mobilogram` and `.split_mobilogram_CWT`, as well as for the parameter class `IMCentWaveParam`
Hi there! I've added a very compact IM dataset that contains a mz-rt region where an intense peak and its M1 isotope can be found. I've added unit tests for I partly solved the problem I had in rformassspectrometry/Spectra#257 by extending the m/z range of the detected peaks to compensate for the fact that a) individual datapoints' m/z fluctuate, and b) I will now be experimenting with the |
Simplify algorithm for split_mobilgram_CWT: instead of a parametric CWT, use a much simpler/robust `localMaxima` approach to detect apexes and `descentMinTol` to determine the regions. Add start-apex-end information in output vector. Change tests accordingly. - Also, fix bug in `extract_mobilogram`, where bin required the y values to be ordered (mentioned in rformassspectrometry/MsCoreUtils#108)
Hi, xcms now has included the Spectra changes, and this PR should now go towards xcms devel branch. |
Thanks for the heads-up @sneumann! I've updated the branch to devel (https://github.blog/2016-08-15-change-the-base-branch-of-a-pull-request/) |
This PR is an initiative to add support for ion-mobility (IM) mass spectrometry data processing to the
xcms
spectra
branch.This is still a work in progress and any contribution or suggestion is more than welcome
Up until now, the following elements have been implemented:
Add a new function for ion-mobility peak picking, compatible with
MsExperiment
'sSpectra
only (TimsTofBackend
), with a corresponding CentWaveParamIM class:Create
do_findChromPeaks_IM_centWave
, the API for the peakpicking itselfCreate a
IMParam
class that inherits fromParam
and that will serve as superclass for all ion-mobility peak picking Params.Create a
CentWaveParamIM
that inherits from bothCentWaveParam
andIMParam
Add a function dispatch point in .mse_find_chrom_peaks_sample
Create documentation for both API and param class
There are still many more things to work on: