Sebastian Paez
version = 0.4.0
This repository implements a very simple LGBM model to predict ion mobility from peptides.
There are two main ways to interact with flimsay
, one is using python
and the other one is using the python api directly.
$ pip install flimsay
$ flimsay fill_blib mylibrary.blib # This will add ion mobility data to a .blib file.
! flimsay fill_blib --help
Usage: flimsay fill_blib [OPTIONS] BLIB OUT_BLIB
Add ion mobility prediction to a .blib file.
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --overwrite Whether to overwrite output file, if it exists │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────╯
from flimsay.model import FlimsayModel
model_instance = FlimsayModel()
model_instance.predict_peptide("MYPEPTIDEK", charge=2)
{'ccs': array([363.36245907]), 'one_over_k0': array([0.92423264])}
import pandas as pd
from flimsay.features import add_features, FEATURE_COLUMNS
df = pd.DataFrame({
"Stripped_Seqs": ["LESLIEK", "LESLIE", "LESLKIE"]
})
df = add_features(
df,
stripped_sequence_name="Stripped_Seqs",
calc_masses=True,
default_charge=2,
)
df
2023-08-04 01:23:45.132 | WARNING | flimsay.features:add_features:163 - Charge not provided, using default charge of 2
Stripped_Seqs | StrippedPeptide | PepLength | NumBulky | NumTiny | NumProlines | NumGlycines | NumSerines | NumPos | PosIndexL | PosIndexR | NumNeg | NegIndexL | NegIndexR | Mass | PrecursorCharge | PrecursorMz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LESLIEK | LESLIEK | 7 | 3 | 1 | 0 | 0 | 1 | 1 | 0.857143 | 0.000000 | 2 | 0.142857 | 0.142857 | 830.474934 | 2 | 416.245292 |
1 | LESLIE | LESLIE | 6 | 3 | 1 | 0 | 0 | 1 | 0 | 1.000000 | 1.000000 | 2 | 0.166667 | 0.000000 | 702.379971 | 2 | 352.197811 |
2 | LESLKIE | LESLKIE | 7 | 3 | 1 | 0 | 0 | 1 | 1 | 0.571429 | 0.285714 | 2 | 0.142857 | 0.000000 | 830.474934 | 2 | 416.245292 |
model_instance.predict(df[FEATURE_COLUMNS])
{'ccs': array([315.32424627, 306.70134752, 314.87268797]),
'one_over_k0': array([0.78718781, 0.72658194, 0.78525451])}
from flimsay.model import FlimsayModel
model_instance = FlimsayModel()
%timeit model_instance.predict_peptide("MYPEPTIDEK", charge=3)
122 µs ± 521 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In my laptop that takes 133 microseconds per peptide, or roughly 7,500 peptides per second.
# Lets make a dataset of 1M peptides to test
import random
import pandas as pd
from flimsay.features import calc_mass, mass_to_mz, add_features
random.seed(42)
AMINO_ACIDS = list("ACDEFGHIKLMNPQRSTVWY")
charges = [2,3,4]
seqs = [random.sample(AMINO_ACIDS, 10) for _ in range(1_000_000)]
charges = [random.sample(charges, 1)[0] for _ in range(1_000_000)]
seqs = ["".join(seq) for seq in seqs]
masses = [calc_mass(x) for x in seqs]
mzs = [mass_to_mz(m, c) for m, c in zip(masses, charges)]
df = pd.DataFrame({
"Stripped_Seqs": seqs,
"PrecursorCharge": charges,
"Mass": masses,
"PrecursorMz": mzs})
df = add_features(df, stripped_sequence_name="Stripped_Seqs")
# Now we get to run the prediction!
%timeit model_instance.predict(df)
6.39 s ± 152 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In my system every million peptides is predicted in 8.86 seconds, that is 113,000 per second.
There is a fair amount of data on CCS and ion mobility of peptides but only very few models actually use features that are directly interpretable.
In addition, having a simpler model allows faster predictions in systems that are not equiped with GPUs.
Therefore, this project aims to create a freely available, easy to use, interpretable and fast model to predict ion mobility and collisional cross-section for peptides.
The features used for prediction are meant to be simple and their implementation can be found here: flimsy/features.py
from flimsay.features import FEATURE_COLUMN_DESCRIPTIONS
for k,v in FEATURE_COLUMN_DESCRIPTIONS.items():
print(f">>> The Feature '{k}' is defined as: \n\t{v}")
>>> The Feature 'PrecursorMz' is defined as:
Measured precursor m/z
>>> The Feature 'Mass' is defined as:
Measured precursor mass (Da)
>>> The Feature 'PrecursorCharge' is defined as:
Measured precursor charge, from the isotope envelope
>>> The Feature 'PepLength' is defined as:
Length of the peptide sequence in amino acids
>>> The Feature 'NumBulky' is defined as:
Number of bulky amino acids (LVIFWY)
>>> The Feature 'NumTiny' is defined as:
Number of tiny amino acids (AS)
>>> The Feature 'NumProlines' is defined as:
Number of proline residues
>>> The Feature 'NumGlycines' is defined as:
Number of glycine residues
>>> The Feature 'NumSerines' is defined as:
Number of serine residues
>>> The Feature 'NumPos' is defined as:
Number of positive amino acids (KRH)
>>> The Feature 'PosIndexL' is defined as:
Relative position of the first positive amino acid (KRH)
>>> The Feature 'PosIndexR' is defined as:
Relative position of the last positive amino acid (KRH)
>>> The Feature 'NumNeg' is defined as:
Number of negative amino acids (DE)
>>> The Feature 'NegIndexL' is defined as:
Relative position of the first negative amino acid (DE)
>>> The Feature 'NegIndexR' is defined as:
Relative position of the last negative amino acid (DE)
Currently the training logic is handled using DVC (https://dvc.org).
git clone {this repo}
cd flimsay/train
dvc repro
Running this should automatically download the data, trian the models, calculate and update the metrics.
The current version of this repo uses predominantly the data from: - Meier, F., Köhler, N.D., Brunner, AD. et al. Deep learning the collisional cross sections of the peptide universe from a million experimental values. Nat Commun 12, 1185 (2021). https://doi.org/10.1038/s41467-021-21352-8