Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add new Ion-mobility peak picking algorithm #647

Draft
wants to merge 8 commits into
base: devel
Choose a base branch
from

Conversation

RogerGinBer
Copy link

This PR is an initiative to add support for ion-mobility (IM) mass spectrometry data processing to the xcms spectra branch.
This is still a work in progress and any contribution or suggestion is more than welcome

Up until now, the following elements have been implemented:

Add a new function for ion-mobility peak picking, compatible with MsExperiment's Spectra only (TimsTofBackend), with a corresponding CentWaveParamIM class:

  • Create do_findChromPeaks_IM_centWave, the API for the peakpicking itself

  • Create a IMParamclass that inherits from Param and that will serve as superclass for all ion-mobility peak picking Params.

  • Create a CentWaveParamIM that inherits from both CentWaveParam and IMParam

  • Add a function dispatch point in .mse_find_chrom_peaks_sample

  • Create documentation for both API and param class

There are still many more things to work on:

  • Unit tests,
  • Retention-time peak alignment for IM peaks
  • IM peak grouping into features
  • (...)

Add a new function for ion-mobility peak picking, compatible with `MsExperiment`'s `Spectra` only (`TimsTofBackend`), with a corresponding CentWaveParamIM class:

- Create `do_findChromPeaks_IM_centWave`, the API for the peakpicking itself

- Create a `IMParam`class that inherits from `Param` and that will serve as superclass for all ion-mobility peak picking Params.

- Create a `CentWaveParamIM` that inherits from both `CentWaveParam` and `IMParam`

- Add a function dispatch point in .mse_find_chrom_peaks_sample

- Create documentation for both API and param class
For better readability and indexing
Copy link
Collaborator

@jorainer jorainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Roger! looks good, but in order to give better advise I need to better understand what the other algorithms you are thinking to implement do.

For just this algorithm I would suggest to put most of the processing directly into .mse_find_chrom_peaks_chunk, i.e. put the code do create the matrix p there into a function that takes also param as input. For IM param classes it could then do the collapsing by frame and return directly the p that can be used for the original do_.... Then, have another function that takes the result from the chrom peak detection and post-processes that. That function could also take param as input, and, if no IM param is provided simply returns the peak matrix. Otherwise, it performs this additional peak detection that you have in your do_findChromPeaks* for IM data.

Happy to discuss :)



## Merging frames into scans and Summarize across IM dimension
message("Collapsing data over IM dimension... ", appendLF = F)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a comment: I would avoid using these message calls - I try to reduce them as I found them pretty annoying if you have many files. Instead, we're now using a progressbar for the complete processing. that looks somewhat cleaner.

#' @author Roger Gine, Johannes Rainer
#'
#' @importFrom Spectra peaksData rtime combineSpectra mz
do_findChromPeaks_IM_centWave <- function(spec,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other ways/algorithms to perform peak detection on IM data that do not first collapse the data and then expand it again like you do here?

If not I would suggest to split the functionality into 3 functions:

  1. collapse peaksData by frame
  2. do peak detection using the original detection algorithm.
  3. have a function that takes the original peaksData and the detected chrom peaks matrix from 2) as input and post processes the data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, yes, there are other ways to perform peak-picking that directly use the "full" data without collapsing (for instance, you could do a 2D-CWT where you change the scales in both RT and IM dimensions and find local maxima; or any peakpicking algorithm such as those used for GCxGC-MS, where you have a similar situation). I haven't looked deeply into such methods, but we should accomodate for them too, just in case

Still, splitting the functionality in do_findChromPeaks_IM_centWave seems good, since those functions would be reusable for other algorithms, etc. Specifically, if you agree, I'll do the following:

  • Call .mse_find_chrom_peaks_sample with the IM-collapsed Spectra object and the param as(param, CentWaveParam") (since it IMCentWaveParam inherits from it) -> That function will take care of extracting the peaksData, rt, valsPerSpect, etc., run do_findChromPeaks_centWave, and return the peak matrix.
  • Encapsulate the post processing (ie. resolving across IM dimension) in another function

Sounds good? 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for my late reply!

splitting functionality is always good, as you say, enables reuse - and makes the code easier to read. so, yes, it sounds good.

so, if I understand:

  • if param inherits IMCentWaveParam collapse the peaksData by frame.
  • pass the data to do_findChromPeaks_centWave for peak detection (since IMCentWaveParam inherits CentWaveParam)
  • if param inherits IMCentWaveParam post process the detected peaks (resolve across IM dimension) and return results.

if the functions get to large, you could also consider implementing a .im_mse_find_chrom_peaks_sample that is called instead of .mse_find_chrom_peaks_sample if param inherits from an IM param object... not sure if that would simplify integration of additional/other IM peak detection methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've encapsulated all this into do_findChromPeaks_IM_centWave, so it's called after dispatching the function corresponding to the param:

    ## Merging all frames from the same scans to summarize across IM dimension
    scans_summarized <-
        Spectra::combineSpectra(
            spec,
            f = as.factor(spec$frameId),
            intensityFun = base::sum,
            weighted = TRUE,
            ppm = ppmMerging
        )
    Spectra::centroided(scans_summarized) <- TRUE
    
    ## 1D Peak-picking on summarized data
    peaks <- .mse_find_chrom_peaks_sample(scans_summarized,
                                          msLevel = 1L,
                                          param = CentWaveParam(ppm = ppm, peakwidth = peakwidth,
                                                        snthresh = snthresh, prefilter = prefilter,
                                                        mzCenterFun = mzCenterFun, integrate = integrate,
                                                        mzdiff = mzdiff, fitgauss = fitgauss, noise = noise,
                                                        verboseColumns = verboseColumns, roiList = roiList,
                                                        firstBaselineCheck = firstBaselineCheck,
                                                        roiScales = roiScales,
                                                        extendLengthMSW = extendLengthMSW))
    
    ## 1D Peak-picking, for each individual peak, to resolve across the IM dimension
    .do_resolve_IM_peaks_CWT(spec, peaks, binWidthIM)

It's basically the all steps you mentioned (collapse, CentWave and resolve), but called from a "lower" function call level, so everything upstream is more tidy. What I like about this is that all Centwave-specific checks and data extraction (mz, int, valsPerSpect, etc.) are handled by .mse_find_chrom_peaks_sample, so we are reusing already-existing code.

I'll commit the proposed refactor so you can take a closer look by yourself

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Let me know when you're OK/ready from your side. I would then like to try the code in action and tinker myself a bit to see if/where we could improve/optimize.

For that I would create yet another branch to play with the code and ask for your feedback on the merge.

Related to that: could you please provide a short code snipped with the example how to perform the analysis (I guess I got already a file from you on which I can test...)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! If you agree, perhaps we should use (for testing) the MsBackendMemory or MsBackendDataFrame to create a manageable toy example from the current read-only MsBackendTimsTof (for instance, subsetting the RT)

I'm still figuring out how to properly centroid the data (the format has some problems that makes the current peakPicks function in Spectra ineffective, see below), but still, the analysis would go somewhat like this:

library(opentimsr)
library(MsBackendTimsTof)
library(MsExperiment)
library(Spectra)
library(xcms)
library(magrittr)

## Setting up Bruker SDK to read the raw data directly (if you have it already, just point to the corresponding file)
so_folder <- tempdir()
so_file <- download_bruker_proprietary_code(so_folder, method = "wget")
setup_bruker_so(so_file)

## Set up valid MsExperiment with the subsetted spectra, then detect IM peaks
be <- backendInitialize(MsBackendTimsTof(), "./path_to_your_file_folder.d")
spec <- Spectra(be) %>%
    filterRt(., c(350, 365)) %>% 
    filterMsLevel(.. 1) %>% 
    setBackend(., MsBackendMemory())

exp <- MsExperiment()
spectra(exp) <- spec
sampleData(exp) <- DataFrame(
    raw_file = normalizePath("./path_to_your_file_folder.d")
)
exp <- linkSampleData(exp, with = "sampleData.raw_file = spectra.dataOrigin")

exp <- findChromPeaks(exp, IMCentWaveParam())

You can use the TIMS-TOF data file I sent you a while back, just bear in mind it's in a zero-less profile mode (working on fixing that) and the chromatographic peaks are usually very short (<6-10s)

.extract_mobilogram <- function(pdata, peak, rt, im, binWidthIM = 0.01){
rtr <- c(peak["rtmin"], peak["rtmax"])
mzr <- c(peak["mzmin"], peak["mzmax"])
keep <- dplyr::between(rt, rtr[1], rtr[2])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using MsCoreUtils::between instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried it, but the performance was much better for dplyr::between, around 2 times faster
Still, since rt is a sorted vector (by default I think, Spectra wouldn't allow it otherwise), using base::findInterval would be even faster. Here's a representative benchmark:

library(microbenchmark)
rt <- rep(1:2000, each = 1000)  ## Big vector, like the ones you'd have for IM scans
rtint <- c(230, 260)
microbenchmark(MsCoreUtils::between(rt, rtint),
                            dplyr::between(rt, rtint[1], rtint[2]),
                            findInterval(rtint, rt), times = 1000)

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow - indeed. note that rtime does not necessarily need to be ordered - Spectra is relatively relaxed there (while mz values indeed are required to be sorted). I'll open an issue in MsCoreUtils - no need to have a between function there if there is a more powerful in dplyr...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I got it the other way around with the rtime and mz, I thought rtime also had to be ordered: then, could we just order the spectra by rtime first so we can use the faster function findInterval?
Something like this, at some point:
spec <- spec[order(rtime(spec))]

I like the fact that the MsCoreUtils::between uses a numeric(2) as input and sorts it before using it (it's neat and avoids silly mistakes). I'll comment that on the corresponding issue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it would be best to start using the MsCoreUtils::between again - Sebastian did some improvement in a C implementation of the code. So it should now be comparable (and eventually even faster) than dplyr::between.

Regarding findInterval and ordering the rtime - you're right, rts should be ordered, but I would like to avoid checking for the order or enforcing an order... better to have the code as independent of possible from potential pre-requisites (that we would then always need to check if they are true within the function).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good! I'll revert back to MsCoreUtils::between then 👍

next
}
bounds <- .split_mobilogram(mobilogram)
new_peaks <- data.frame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using data.frame here might have an impact on performance. I think matrix might be faster, but that's something we can check later.

@RogerGinBer
Copy link
Author

Hi Johannes, thanks for the code review!
About the algorithms:

  • Regarding peak-picking: as I commented above, this sequential 1D + 1D peak-picking (ie. first in the RT domain, and then, peak-wise, in the IM domain) is not the only possible strategy. One could also want to do a direct 2D peak-picking (for instance, here's a python package that does it https://deimos.readthedocs.io/en/latest/user_guide/peak_detection.html) without collapsing the IM data at all. Since both seem to be valid, we should try to be flexible
  • Regarding other processing steps (alignment, grouping, etc.): we still have to reach that stage, but I'm sure we'll need to make some adjustments to current functions to keep track of the IM dimension data and avoid mixing up the peaks

For now: I'll make the amendments we discussed above, profile the algorithm performance, and create some unit tests for all new functions

@jorainer
Copy link
Collaborator

Perfect! Thanks @RogerGinBer for the updates! And yes, we need to keep track of the IM dimension data for the identified peaks - that information could either go directly (as a new column) into the chromPeaks matrix or alternatively to the chromPeakData data.frame.

@RogerGinBer
Copy link
Author

Good! I think I'll keep the IM ranges in the chromPeaks matrix for now, since its something "essential" that defines the peaks
I'll keep you updated by posting new commits and ping you when I need some feedback 👍

- Remove `message` calls from within the function
- Split functionality in 3 parts:
1. collapse/merge IM scans,
2. call CentWave on collapsed data,
3. post-process peak matrix to resolve IM peaks.
- Add termination `CWT` to `.do_resolve_IM_peaks` and  `split_mobilogram` to differenciate them if more methods are added
@jorainer
Copy link
Collaborator

Something else: could you please already start to add unit tests? it's usually better to implement them right away for each function you write.

Add the dispatch point to `.mse_find_chrom_peaks_chunk`
Until now the IM peakpicking was only available through .mse_find_chrom_peaks_sample
@RogerGinBer
Copy link
Author

Yep, you're right, I completely forgot about that! 🙏
Until now I've been running some tests locally, since I didn't know that you could change the backend from the read-only MsBackendTimsTof to something more portable like MsBackendMemory. My question is: how do I add the IM data for the tests to the package? Do I generate a very limited Spectra subset from the file I'm using? How small of an object (kB?) are we talking about?

Also, I found out that the raw IM data I was reading is in profile mode but does not have any flanking zeroes, so the localMaxima approach of Spectra::pickPeaks sadly does not work, but I have an idea for a workaround. I'll open an issue over at Spectra to describe the problem and ping here

@jorainer
Copy link
Collaborator

Thanks for the update - regarding the test/toy data set: I think a Spectra with MsBackendMemory would be ideal. As you mentioned, size should be small, maybe limit to a small m/z range and retention time window for a region where you expect to see some peaks/signals?

Since we use `combineSpectra` to merge mobility scans into a frame-level scan, the reported mz value *shrinks* to the weigthed average, causing problems downstream when trying to calculate the mobilograms. Therefore, we are expanding by `ppmMerging` the peak mz range after CentWave peak detection

Also, add checks for number of rows of output
Add a compact IM data file for running tests
Add tests for `do_findChromPeaks_IM_centWave`, internals `.extract_mobilogram` and `.split_mobilogram_CWT`, as well as for the parameter class `IMCentWaveParam`
@RogerGinBer
Copy link
Author

Hi there!

I've added a very compact IM dataset that contains a mz-rt region where an intense peak and its M1 isotope can be found. I've added unit tests for do_findChromPeaks_IM_centWave and its internals for calculating and splitting mobilograms, as well as for the IMCentWaveParam class.

I partly solved the problem I had in rformassspectrometry/Spectra#257 by extending the m/z range of the detected peaks to compensate for the fact that a) individual datapoints' m/z fluctuate, and b) combineSpectra groups similar m/z values into a single one. Without this correction, I had a problem where the reported mz range for a peak didn't match with any of the original m/z values in the non-combined Spectra.

I will now be experimenting with the do_findChromPeaks_IM_centWave function to make sure it's working properly, but at least now we have some proper unit tests set up 👍

Simplify algorithm for split_mobilgram_CWT: instead of a parametric CWT, use a much simpler/robust `localMaxima` approach to detect apexes and `descentMinTol` to determine the regions. Add start-apex-end information in output vector. Change tests accordingly.

- Also, fix bug in `extract_mobilogram`, where bin required the y values to be ordered (mentioned in rformassspectrometry/MsCoreUtils#108)
@sneumann
Copy link
Owner

Hi, xcms now has included the Spectra changes, and this PR should now go towards xcms devel branch.
Norman and I actually just checked, and there would be no merge conflicts.
I don't know if one can change the target branch for a PR easily, otherwise feel free to close and re-open agains devel.
Yours,
Steffen

@RogerGinBer RogerGinBer changed the base branch from spectra to devel July 11, 2023 08:22
@RogerGinBer
Copy link
Author

Thanks for the heads-up @sneumann! I've updated the branch to devel (https://github.blog/2016-08-15-change-the-base-branch-of-a-pull-request/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants