[WIP] Add new Ion-mobility peak picking algorithm #647

RogerGinBer · 2022-12-02T17:51:04Z

This PR is an initiative to add support for ion-mobility (IM) mass spectrometry data processing to the xcms spectra branch.
This is still a work in progress and any contribution or suggestion is more than welcome

Up until now, the following elements have been implemented:

Add a new function for ion-mobility peak picking, compatible with MsExperiment's Spectra only (TimsTofBackend), with a corresponding CentWaveParamIM class:

Create do_findChromPeaks_IM_centWave, the API for the peakpicking itself
Create a IMParamclass that inherits from Param and that will serve as superclass for all ion-mobility peak picking Params.
Create a CentWaveParamIM that inherits from both CentWaveParam and IMParam
Add a function dispatch point in .mse_find_chrom_peaks_sample
Create documentation for both API and param class

There are still many more things to work on:

Unit tests,
Retention-time peak alignment for IM peaks
IM peak grouping into features
(...)

Add a new function for ion-mobility peak picking, compatible with `MsExperiment`'s `Spectra` only (`TimsTofBackend`), with a corresponding CentWaveParamIM class: - Create `do_findChromPeaks_IM_centWave`, the API for the peakpicking itself - Create a `IMParam`class that inherits from `Param` and that will serve as superclass for all ion-mobility peak picking Params. - Create a `CentWaveParamIM` that inherits from both `CentWaveParam` and `IMParam` - Add a function dispatch point in .mse_find_chrom_peaks_sample - Create documentation for both API and param class

For better readability and indexing

jorainer

Thanks Roger! looks good, but in order to give better advise I need to better understand what the other algorithms you are thinking to implement do.

For just this algorithm I would suggest to put most of the processing directly into .mse_find_chrom_peaks_chunk, i.e. put the code do create the matrix p there into a function that takes also param as input. For IM param classes it could then do the collapsing by frame and return directly the p that can be used for the original do_.... Then, have another function that takes the result from the chrom peak detection and post-processes that. That function could also take param as input, and, if no IM param is provided simply returns the peak matrix. Otherwise, it performs this additional peak detection that you have in your do_findChromPeaks* for IM data.

Happy to discuss :)

jorainer · 2022-12-07T10:37:23Z

R/do_findChromPeaks-functions.R

+
+
+    ## Merging frames into scans and Summarize across IM dimension
+    message("Collapsing data over IM dimension... ", appendLF = F)


just a comment: I would avoid using these message calls - I try to reduce them as I found them pretty annoying if you have many files. Instead, we're now using a progressbar for the complete processing. that looks somewhat cleaner.

jorainer · 2022-12-07T11:03:08Z

R/do_findChromPeaks-functions.R

+#' @author Roger Gine, Johannes Rainer
+#' 
+#' @importFrom Spectra peaksData rtime combineSpectra mz
+do_findChromPeaks_IM_centWave <- function(spec,


Are there other ways/algorithms to perform peak detection on IM data that do not first collapse the data and then expand it again like you do here?

If not I would suggest to split the functionality into 3 functions:

collapse peaksData by frame

do peak detection using the original detection algorithm.

have a function that takes the original peaksData and the detected chrom peaks matrix from 2) as input and post processes the data.

In principle, yes, there are other ways to perform peak-picking that directly use the "full" data without collapsing (for instance, you could do a 2D-CWT where you change the scales in both RT and IM dimensions and find local maxima; or any peakpicking algorithm such as those used for GCxGC-MS, where you have a similar situation). I haven't looked deeply into such methods, but we should accomodate for them too, just in case

Still, splitting the functionality in do_findChromPeaks_IM_centWave seems good, since those functions would be reusable for other algorithms, etc. Specifically, if you agree, I'll do the following:

Call .mse_find_chrom_peaks_sample with the IM-collapsed Spectra object and the param as(param, CentWaveParam") (since it IMCentWaveParam inherits from it) -> That function will take care of extracting the peaksData, rt, valsPerSpect, etc., run do_findChromPeaks_centWave, and return the peak matrix.

Encapsulate the post processing (ie. resolving across IM dimension) in another function

Sounds good? 👍

sorry for my late reply!

splitting functionality is always good, as you say, enables reuse - and makes the code easier to read. so, yes, it sounds good.

so, if I understand:

if param inherits IMCentWaveParam collapse the peaksData by frame.

pass the data to do_findChromPeaks_centWave for peak detection (since IMCentWaveParam inherits CentWaveParam)

if param inherits IMCentWaveParam post process the detected peaks (resolve across IM dimension) and return results.

if the functions get to large, you could also consider implementing a .im_mse_find_chrom_peaks_sample that is called instead of .mse_find_chrom_peaks_sample if param inherits from an IM param object... not sure if that would simplify integration of additional/other IM peak detection methods.

For now I've encapsulated all this into do_findChromPeaks_IM_centWave, so it's called after dispatching the function corresponding to the param:

## Merging all frames from the same scans to summarize across IM dimension scans_summarized <- Spectra::combineSpectra( spec, f = as.factor(spec$frameId), intensityFun = base::sum, weighted = TRUE, ppm = ppmMerging ) Spectra::centroided(scans_summarized) <- TRUE ## 1D Peak-picking on summarized data peaks <- .mse_find_chrom_peaks_sample(scans_summarized, msLevel = 1L, param = CentWaveParam(ppm = ppm, peakwidth = peakwidth, snthresh = snthresh, prefilter = prefilter, mzCenterFun = mzCenterFun, integrate = integrate, mzdiff = mzdiff, fitgauss = fitgauss, noise = noise, verboseColumns = verboseColumns, roiList = roiList, firstBaselineCheck = firstBaselineCheck, roiScales = roiScales, extendLengthMSW = extendLengthMSW)) ## 1D Peak-picking, for each individual peak, to resolve across the IM dimension .do_resolve_IM_peaks_CWT(spec, peaks, binWidthIM)

It's basically the all steps you mentioned (collapse, CentWave and resolve), but called from a "lower" function call level, so everything upstream is more tidy. What I like about this is that all Centwave-specific checks and data extraction (mz, int, valsPerSpect, etc.) are handled by .mse_find_chrom_peaks_sample, so we are reusing already-existing code.

I'll commit the proposed refactor so you can take a closer look by yourself

Excellent. Let me know when you're OK/ready from your side. I would then like to try the code in action and tinker myself a bit to see if/where we could improve/optimize.

For that I would create yet another branch to play with the code and ask for your feedback on the merge.

Related to that: could you please provide a short code snipped with the example how to perform the analysis (I guess I got already a file from you on which I can test...)?

Definitely! If you agree, perhaps we should use (for testing) the MsBackendMemory or MsBackendDataFrame to create a manageable toy example from the current read-only MsBackendTimsTof (for instance, subsetting the RT)

I'm still figuring out how to properly centroid the data (the format has some problems that makes the current peakPicks function in Spectra ineffective, see below), but still, the analysis would go somewhat like this:

library(opentimsr) library(MsBackendTimsTof) library(MsExperiment) library(Spectra) library(xcms) library(magrittr) ## Setting up Bruker SDK to read the raw data directly (if you have it already, just point to the corresponding file) so_folder <- tempdir() so_file <- download_bruker_proprietary_code(so_folder, method = "wget") setup_bruker_so(so_file) ## Set up valid MsExperiment with the subsetted spectra, then detect IM peaks be <- backendInitialize(MsBackendTimsTof(), "./path_to_your_file_folder.d") spec <- Spectra(be) %>% filterRt(., c(350, 365)) %>% filterMsLevel(.. 1) %>% setBackend(., MsBackendMemory()) exp <- MsExperiment() spectra(exp) <- spec sampleData(exp) <- DataFrame( raw_file = normalizePath("./path_to_your_file_folder.d") ) exp <- linkSampleData(exp, with = "sampleData.raw_file = spectra.dataOrigin") exp <- findChromPeaks(exp, IMCentWaveParam())

You can use the TIMS-TOF data file I sent you a while back, just bear in mind it's in a zero-less profile mode (working on fixing that) and the chromatographic peaks are usually very short (<6-10s)

jorainer · 2022-12-07T11:04:26Z

R/do_findChromPeaks-functions.R

+.extract_mobilogram <- function(pdata, peak, rt, im, binWidthIM = 0.01){
+    rtr <- c(peak["rtmin"], peak["rtmax"])
+    mzr <- c(peak["mzmin"], peak["mzmax"])
+    keep <- dplyr::between(rt, rtr[1], rtr[2])


Why not using MsCoreUtils::between instead?

I also tried it, but the performance was much better for dplyr::between, around 2 times faster
Still, since rt is a sorted vector (by default I think, Spectra wouldn't allow it otherwise), using base::findInterval would be even faster. Here's a representative benchmark:

library(microbenchmark) rt <- rep(1:2000, each = 1000) ## Big vector, like the ones you'd have for IM scans rtint <- c(230, 260) microbenchmark(MsCoreUtils::between(rt, rtint), dplyr::between(rt, rtint[1], rtint[2]), findInterval(rtint, rt), times = 1000)

wow - indeed. note that rtime does not necessarily need to be ordered - Spectra is relatively relaxed there (while mz values indeed are required to be sorted). I'll open an issue in MsCoreUtils - no need to have a between function there if there is a more powerful in dplyr...

I see, I got it the other way around with the rtime and mz, I thought rtime also had to be ordered: then, could we just order the spectra by rtime first so we can use the faster function findInterval?
Something like this, at some point:
spec <- spec[order(rtime(spec))]

I like the fact that the MsCoreUtils::between uses a numeric(2) as input and sorts it before using it (it's neat and avoids silly mistakes). I'll comment that on the corresponding issue

Actually, I think it would be best to start using the MsCoreUtils::between again - Sebastian did some improvement in a C implementation of the code. So it should now be comparable (and eventually even faster) than dplyr::between.

Regarding findInterval and ordering the rtime - you're right, rts should be ordered, but I would like to avoid checking for the order or enforcing an order... better to have the code as independent of possible from potential pre-requisites (that we would then always need to check if they are true within the function).

Seems good! I'll revert back to MsCoreUtils::between then 👍

jorainer · 2022-12-07T11:05:47Z

R/do_findChromPeaks-functions.R

+            next
+        }
+        bounds <- .split_mobilogram(mobilogram)
+        new_peaks <- data.frame(


I think using data.frame here might have an impact on performance. I think matrix might be faster, but that's something we can check later.

RogerGinBer · 2022-12-07T22:56:17Z

Hi Johannes, thanks for the code review!
About the algorithms:

Regarding peak-picking: as I commented above, this sequential 1D + 1D peak-picking (ie. first in the RT domain, and then, peak-wise, in the IM domain) is not the only possible strategy. One could also want to do a direct 2D peak-picking (for instance, here's a python package that does it https://deimos.readthedocs.io/en/latest/user_guide/peak_detection.html) without collapsing the IM data at all. Since both seem to be valid, we should try to be flexible
Regarding other processing steps (alignment, grouping, etc.): we still have to reach that stage, but I'm sure we'll need to make some adjustments to current functions to keep track of the IM dimension data and avoid mixing up the peaks

For now: I'll make the amendments we discussed above, profile the algorithm performance, and create some unit tests for all new functions

jorainer · 2022-12-13T14:06:47Z

Perfect! Thanks @RogerGinBer for the updates! And yes, we need to keep track of the IM dimension data for the identified peaks - that information could either go directly (as a new column) into the chromPeaks matrix or alternatively to the chromPeakData data.frame.

RogerGinBer · 2022-12-13T18:16:43Z

Good! I think I'll keep the IM ranges in the chromPeaks matrix for now, since its something "essential" that defines the peaks
I'll keep you updated by posting new commits and ping you when I need some feedback 👍

- Remove `message` calls from within the function - Split functionality in 3 parts: 1. collapse/merge IM scans, 2. call CentWave on collapsed data, 3. post-process peak matrix to resolve IM peaks. - Add termination `CWT` to `.do_resolve_IM_peaks` and `split_mobilogram` to differenciate them if more methods are added

jorainer · 2022-12-21T09:55:42Z

Something else: could you please already start to add unit tests? it's usually better to implement them right away for each function you write.

Add the dispatch point to `.mse_find_chrom_peaks_chunk` Until now the IM peakpicking was only available through .mse_find_chrom_peaks_sample

RogerGinBer · 2022-12-21T21:29:49Z

Yep, you're right, I completely forgot about that! 🙏
Until now I've been running some tests locally, since I didn't know that you could change the backend from the read-only MsBackendTimsTof to something more portable like MsBackendMemory. My question is: how do I add the IM data for the tests to the package? Do I generate a very limited Spectra subset from the file I'm using? How small of an object (kB?) are we talking about?

Also, I found out that the raw IM data I was reading is in profile mode but does not have any flanking zeroes, so the localMaxima approach of Spectra::pickPeaks sadly does not work, but I have an idea for a workaround. I'll open an issue over at Spectra to describe the problem and ping here

jorainer · 2022-12-23T10:03:23Z

Thanks for the update - regarding the test/toy data set: I think a Spectra with MsBackendMemory would be ideal. As you mentioned, size should be small, maybe limit to a small m/z range and retention time window for a region where you expect to see some peaks/signals?

Since we use `combineSpectra` to merge mobility scans into a frame-level scan, the reported mz value *shrinks* to the weigthed average, causing problems downstream when trying to calculate the mobilograms. Therefore, we are expanding by `ppmMerging` the peak mz range after CentWave peak detection Also, add checks for number of rows of output

Add a compact IM data file for running tests Add tests for `do_findChromPeaks_IM_centWave`, internals `.extract_mobilogram` and `.split_mobilogram_CWT`, as well as for the parameter class `IMCentWaveParam`

RogerGinBer · 2023-01-03T22:38:42Z

Hi there!

I've added a very compact IM dataset that contains a mz-rt region where an intense peak and its M1 isotope can be found. I've added unit tests for do_findChromPeaks_IM_centWave and its internals for calculating and splitting mobilograms, as well as for the IMCentWaveParam class.

I partly solved the problem I had in rformassspectrometry/Spectra#257 by extending the m/z range of the detected peaks to compensate for the fact that a) individual datapoints' m/z fluctuate, and b) combineSpectra groups similar m/z values into a single one. Without this correction, I had a problem where the reported mz range for a peak didn't match with any of the original m/z values in the non-combined Spectra.

I will now be experimenting with the do_findChromPeaks_IM_centWave function to make sure it's working properly, but at least now we have some proper unit tests set up 👍

Simplify algorithm for split_mobilgram_CWT: instead of a parametric CWT, use a much simpler/robust `localMaxima` approach to detect apexes and `descentMinTol` to determine the regions. Add start-apex-end information in output vector. Change tests accordingly. - Also, fix bug in `extract_mobilogram`, where bin required the y values to be ordered (mentioned in rformassspectrometry/MsCoreUtils#108)

sneumann · 2023-07-10T15:07:28Z

Hi, xcms now has included the Spectra changes, and this PR should now go towards xcms devel branch.
Norman and I actually just checked, and there would be no merge conflicts.
I don't know if one can change the target branch for a PR easily, otherwise feel free to close and re-open agains devel.
Yours,
Steffen

RogerGinBer · 2023-07-11T08:30:24Z

Thanks for the heads-up @sneumann! I've updated the branch to devel (https://github.blog/2016-08-15-change-the-base-branch-of-a-pull-request/)

RogerGinBer added 2 commits December 2, 2022 10:38

Change CentWaveParamIM to IMCentWaveParam

92eeec1

For better readability and indexing

jorainer reviewed Dec 7, 2022

View reviewed changes

feat: Add IM peak-picking dispatch point by chunks

216210f

Add the dispatch point to `.mse_find_chrom_peaks_chunk` Until now the IM peakpicking was only available through .mse_find_chrom_peaks_sample

RogerGinBer mentioned this pull request Dec 21, 2022

pickPeaks fails to centroid if no flanking zeroes are present rformassspectrometry/Spectra#257

Open

RogerGinBer added 3 commits January 3, 2023 23:20

feat: add getters and setters for IMCentwaveParam class

ce8daa4

test: add unit tests and sample IM data

f8f4826

Add a compact IM data file for running tests Add tests for `do_findChromPeaks_IM_centWave`, internals `.extract_mobilogram` and `.split_mobilogram_CWT`, as well as for the parameter class `IMCentWaveParam`

RogerGinBer mentioned this pull request Jan 4, 2023

bin: y argument must be an ordered vector rformassspectrometry/MsCoreUtils#108

Closed

RogerGinBer changed the base branch from spectra to devel July 11, 2023 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add new Ion-mobility peak picking algorithm #647

[WIP] Add new Ion-mobility peak picking algorithm #647

RogerGinBer commented Dec 2, 2022

jorainer left a comment

jorainer Dec 7, 2022

jorainer Dec 7, 2022

RogerGinBer Dec 7, 2022

jorainer Dec 13, 2022

RogerGinBer Dec 13, 2022

jorainer Dec 21, 2022

RogerGinBer Dec 21, 2022

jorainer Dec 7, 2022

RogerGinBer Dec 7, 2022

jorainer Dec 13, 2022

RogerGinBer Dec 13, 2022

jorainer Dec 21, 2022

RogerGinBer Dec 21, 2022

jorainer Dec 7, 2022

RogerGinBer commented Dec 7, 2022

jorainer commented Dec 13, 2022

RogerGinBer commented Dec 13, 2022

jorainer commented Dec 21, 2022

RogerGinBer commented Dec 21, 2022

jorainer commented Dec 23, 2022

RogerGinBer commented Jan 3, 2023

sneumann commented Jul 10, 2023

RogerGinBer commented Jul 11, 2023



		## Merging frames into scans and Summarize across IM dimension
		message("Collapsing data over IM dimension... ", appendLF = F)

[WIP] Add new Ion-mobility peak picking algorithm #647

Are you sure you want to change the base?

[WIP] Add new Ion-mobility peak picking algorithm #647

Conversation

RogerGinBer commented Dec 2, 2022

jorainer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RogerGinBer commented Dec 7, 2022

jorainer commented Dec 13, 2022

RogerGinBer commented Dec 13, 2022

jorainer commented Dec 21, 2022

RogerGinBer commented Dec 21, 2022

jorainer commented Dec 23, 2022

RogerGinBer commented Jan 3, 2023

sneumann commented Jul 10, 2023

RogerGinBer commented Jul 11, 2023