Inconsistent metrics result between eval and denovo mode, and possible reason #122

irleader · 2023-01-17T12:24:54Z

This is a follow-up of issue #111, thanks to all the information provided, I am able to find a possible bug and its reason.

I did experiment with weights "pretrained_excl_yeast.ckpt", deepnovo original raw yeast dataset, and Casanovo version 3.1.0.

In eval mode (results displayed on console):
aa_precision: 0.7057459339992785
aa_recall: 0.7071716086435299
pep_recall: 0.5046985050978404

In denovo mode (results manually calculated from output mztab file):
aa_precision: 0.6827751198617065
aa_recall: 0.6843725335842246
pep_recall: 0.5046985051027288

There is difference between results from eval and denovo mode (aa_precision, aa_recall), so I studies model.py and its pytorch lightning implementation.

I first excluded the possibility that this may be caused by predictions without '$' or with multiple '$$' (it turns out there are no such predictions), because this line of code removes the first position of prediction without checking whether it is '$', I am referring to v3.2.0, the code in v3.1.0 is similar.

Then I checked validation part of the code, and find the inconsistence might be caused by the way pytorch lightning accumulates metrics from each step and calculates per-epoch metrics. So I manually accumulates the prediction data from each step and calculate metrics for one epoch, and find it is the same as the results from denovo mode, which verifies my guess. For manual calculation, I use methods from evaluate.py.

Manual calculation from validation_step() output:
aa_precision: 0.6827751198617065
aa_recall: 0.6843725335842246
pep_recall: 0.5046985051027288

In order to further verify, I tried with deepnovo raw ricebean dataset, and see the same pattern.

In eval mode (results displayed on console):
aa_precision: 0.5835929214378351
aa_recall: 0.583695772587527
pep_recall: 0.4223428193208169

In denovo mode (results manually calculated from output mztab file):
aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586
pep_recall: 0.4223428193248386

Manual calculation from validation_step() output:
aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586
pep_recall: 0.4223428193248386

So the conclusion is, pytorch lightning can accumulate pep_recall from each step and calculate it correctly for each epoch, but it may fail the same job for aa_precision and aa_recall.

Ps. I again would like to request numerical values of Casanovo's AA/peptide precision/recall for updated 9-species benchmark (Figure S5 from preprint). Thanks a lot.

melihyilmaz · 2023-01-18T20:18:00Z

Hi,

Thanks for pointing out the pytorch lightening bug relating to metric accumulation over steps, looks like that's reason behind discrepancy in results, and also the duplicate stop token checking - this shouldn't be happening in the current implementation but we still added a check with the recent PR #123 to catch these.

Also sorry for the delay but you can find the Casanovo peptide/AA precision values at coverage=1 for the updated 9-species benchmark below (corresponding to Figure S2 and S5 from preprint):

mouse
peptide precision=0.546022191276223
AA precision=0.8246832939458176

ricebean
peptide precision=0.762752583426578
AA precision=0.8981027063629017

human
peptide precision=0.7320067005545287
AA precision=0.8993333233520767

tomato
peptide precision=0.7904856253285351
AA precision=0.909782514214078

clambacteria
peptide precision=0.6535406820543996
AA precision=0.868507208789983

honeybee
peptide precision=0.6471156883806154
AA precision=0.8149989493510443

mmazei
peptide precision=0.7260779989677485
AA precision=0.8701884249037223

bacillus
peptide precision=0.7817952157663147
AA precision=0.8807972308692891

yeast
peptide precision=0.8383814663340254
AA precision=0.915618641976876

irleader · 2023-01-19T14:19:54Z

Thank you very much for the values!

For preprint Figure S2 and S5 only:

I assume the results are from Casanovo v3.2.0 trained with MASSIVEKB dataset and tested on each of the revised 9 species, with beam=10 and precursor window size=30ppm as mentioned in the preprint?
For Casanovo bm (Casanovo v2.1.1) in Figure S2 and S5, is it also trained with MASSIVEKB dataset, or it is trained with the other 8 species in the revised benchmark?
MASSIVEKB dataset has all 7 modifications, while current release of revised 9 species benchmark doesn't contain the 4 N-terminal modifications yet (Modifications and their representations in revised 9-species benchmark #125). For these benchmark results, do you use exactly the same data as in this link without N-terminal modifications?

Thanks!

melihyilmaz · 2023-01-19T19:06:11Z

Yes
Casanovo_bm was trained on the older version of the benchmark, i.e. trained on 8 species from the older benchmark and tested on the remaining 9th species on the original, similar to DeepNovo.
The benchmark dataset these results were obtained from includes N-terminal modifications (see Figure S4 for results when a subset of PTMs that DeepNovo and Novor couldn't predict were excluded from the benchmark). See the update on Modifications and their representations in revised 9-species benchmark #125 and check out the regenerated benchmark once it's available.

irleader · 2023-01-26T14:32:27Z

A similar situation has been discover in train mode, the "AA precision AA recall Peptide recall" of validation dataset displayed on console is also much higher than the actual accuracy, which might also be caused by pytorch-lightning codes in model.py. I would also suggest not use these values, I have compare them with manually calculated AA precision and peptide recall values from output mztab files, these displayed values are consistently higher. They can only be used as an indicator of when to stop training but not accurate performance of benchmark dataset.

melihyilmaz · 2023-01-26T19:28:38Z

Thanks for pointing this out - we're also relying on manual calculation of precision values when reporting results

wsnoble · 2023-01-31T17:47:01Z

Notes from our discussion just now:

One hypothesis (from Wout) is that this may be due to a mismatch in the batch size, and it throws the calculation off.

Another is that it has to do with how Pytorch Lightning aggregates results at the level of epoch. But if that's the case, then others would have run into this as well. Googling did not turn anything up.

irleader · 2023-02-01T02:20:51Z

Thanks for letting me know!

wsnoble · 2023-02-01T03:20:43Z

I am reopening this because we are still hoping to track down this discrepancy. It might not be immediate, though.

bittremieux · 2023-10-23T07:20:19Z

Although the values seem different, this should also have been impacted by #252. We should check again after merging #253.

Lilferrit · 2024-07-11T17:50:17Z

I wrote a script to calculate the AA and Peptide precision of the Casanovo PSMs, and I am seeing a minor discrepancy between the output of my script and Casanovo's evaluate mode. I ran the script and evaluate mode on the sample data, which has no skipped spectra during sequencing, so the error is unlikely to be originating from that. From what I can tell the discrepancy is either from an error in my script or we're still seeing batch effects in eval mode. Here's the output from my script and eval mode:

(casanovo_env) C:\Users\gavin\OneDrive\Documents\casanovo>casanovo evaluate sample_data\sample_preprocessed_spectra.mgf -c data/casanovo.yaml
WARNING: Dataloader multiprocessing is currently not supported on Windows or MacOS; using only a single thread.
Seed set to 454
INFO: Model weights file C:\Users\gavin\AppData\Local\casanovo\casanovo_v420_v4_2_0.ckpt retrieved from local cache
INFO: Casanovo version 4.1.1.dev15+gcedfaa7.d20240625
INFO: Sequencing and evaluating peptides from:
INFO:   sample_data\sample_preprocessed_spectra.mgf
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
sample_data\sample_preprocessed_spectra.mgf: 128spectra [00:00, 3208.59spectra/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Validation DataLoader 0:   0%|                                                                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]WARNING: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
WARNING: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
WARNING: D:\anaconda3\envs\casanovo_env\lib\site-packages\lightning\pytorch\utilities\data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

WARNING: D:\anaconda3\envs\casanovo_env\lib\site-packages\lightning\pytorch\utilities\data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  0.27it/s]INFO: Step       Train loss      Valid loss      Peptide precision       AA precision
INFO: 0 nan     0.376451        0.492188        0.745707
Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  0.26it/s] 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃         Validate metric         ┃          DataLoader 0           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│   AA precision at coverage=1    │       0.7457072734832764        │
│ Peptide precision at coverage=1 │            0.4921875            │
│          valid_CELoss           │        0.376451313495636        │
└─────────────────────────────────┴─────────────────────────────────┘
INFO: ======= End of Run Report =======
INFO: Run Start Time: 24/07/11 10:45:17
INFO: Run End Time: 24/07/11 10:45:37
INFO: Time Elapsed: 0:00:19.993431
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo evaluate sample_data\sample_preprocessed_spectra.mgf -c data/casanovo.yaml
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Max GPU Memory Utilization: 381 MiB

(casanovo_env) C:\Users\gavin\OneDrive\Documents\casanovo>python val_bug_test.py
AA Precision: 0.7432542927167355
Pep Precision: 0.48437499996215816

Also, here is the test script:

import depthcharge.masses
import casanovo
import depthcharge
import pyteomics.mztab

import casanovo.denovo
import casanovo.denovo.evaluate


def process_mgf_file(mgf_path):
    with open(mgf_path) as f:
        return [
            line[4:].strip() for line in f if line.startswith("SEQ=")
        ]


def main():
    mz_tab = pyteomics.mztab.MzTab("foo.mzTab")
    output = mz_tab.spectrum_match_table["sequence"].tolist()
    ground_truth = process_mgf_file("sample_data\sample_preprocessed_spectra.mgf")
    peptide_masses = depthcharge.masses.PeptideMass()

    aa_precision, _, pep_precision = casanovo.denovo.evaluate.aa_match_metrics(
        *casanovo.denovo.evaluate.aa_match_batch(
            ground_truth,
            output,
            peptide_masses.masses
        )
    )

    print(f"AA Precision: {aa_precision}")
    print(f"Pep Precision: {pep_precision}")

if __name__ == "__main__":
    main()

bittremieux · 2024-07-12T08:58:58Z

Your script is correct, so I think it's still an issue with aggregating the results across multiple batches.

Maybe this can be relatively easily checked by running evaluation with a single batch only (by expanding the batch size and/or reducing the number of spectra). Although, come to think of it, the batch size for inference should standard be 1024, so maybe that's already the case. Could you double-check this?

wsnoble · 2024-07-12T13:42:22Z

I haven't looked at the code, but doesn't it seem a bit crazy to try to compute these metrics in a batchwise fashion? Why not just aggregate a score vector and a label vector and then do a single call to the metric calculation at the end?

bittremieux · 2024-07-12T16:05:58Z

Now it's calculated at the end of each validation batch, where the loss is also calculated. Otherwise you have to keep all of the predictions and true labels in memory across the whole run, which can easily be on the order of millions. So in terms of code it's not really crazy but actually a natural way to do it. But maybe leading to slightly incorrect results then.

Lilferrit · 2024-07-12T16:31:40Z

I checked the default config and batch size for inference does indeed default to 1024. As for the batch effects, I may be misunderstanding the amino acid and peptide precision scores but wouldn't be possible to keep a count of the number of correct amino acids and peptides, as well as the total number of amino acids and peptides, and use this values to do the precision scores calculation at the end of an evaluation run?

bittremieux · 2024-07-12T16:38:59Z

We should first verify what the actual problem is. This batched data aggregation was only my hypothesis, we haven't isolated the issue yet. You could debug the code I referred to a bit more to see what the values there are and try to find the point where they deviate.

Lilferrit · 2024-07-12T18:16:22Z

After some further investigation it turns out that this issue persists even in the case where there's only one validation step. I'm still trying to track down the exact discrepancy though.

Lilferrit · 2024-08-12T19:39:00Z

The tentative implementation of #355 calculates the evaluation metrics in a way that is roughly equivalent to the test script. I'll reexamine this once #355 has been merged in.

bittremieux mentioned this issue Jan 17, 2023

Check for stop token #123

Merged

bittremieux added the bug Something isn't working label Jan 17, 2023

irleader closed this as completed Feb 1, 2023

wsnoble reopened this Feb 1, 2023

Lilferrit self-assigned this Jul 3, 2024

bittremieux closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent metrics result between eval and denovo mode, and possible reason #122

Inconsistent metrics result between eval and denovo mode, and possible reason #122

irleader commented Jan 17, 2023

melihyilmaz commented Jan 18, 2023

irleader commented Jan 19, 2023

melihyilmaz commented Jan 19, 2023

irleader commented Jan 26, 2023

melihyilmaz commented Jan 26, 2023

wsnoble commented Jan 31, 2023

irleader commented Feb 1, 2023

wsnoble commented Feb 1, 2023

bittremieux commented Oct 23, 2023

Lilferrit commented Jul 11, 2024 •

edited

Loading

bittremieux commented Jul 12, 2024

wsnoble commented Jul 12, 2024

bittremieux commented Jul 12, 2024

Lilferrit commented Jul 12, 2024 •

edited

Loading

bittremieux commented Jul 12, 2024

Lilferrit commented Jul 12, 2024

Lilferrit commented Aug 12, 2024

Inconsistent metrics result between eval and denovo mode, and possible reason #122

Inconsistent metrics result between eval and denovo mode, and possible reason #122

Comments

irleader commented Jan 17, 2023

melihyilmaz commented Jan 18, 2023

irleader commented Jan 19, 2023

melihyilmaz commented Jan 19, 2023

irleader commented Jan 26, 2023

melihyilmaz commented Jan 26, 2023

wsnoble commented Jan 31, 2023

irleader commented Feb 1, 2023

wsnoble commented Feb 1, 2023

bittremieux commented Oct 23, 2023

Lilferrit commented Jul 11, 2024 • edited Loading

bittremieux commented Jul 12, 2024

wsnoble commented Jul 12, 2024

bittremieux commented Jul 12, 2024

Lilferrit commented Jul 12, 2024 • edited Loading

bittremieux commented Jul 12, 2024

Lilferrit commented Jul 12, 2024

Lilferrit commented Aug 12, 2024

Lilferrit commented Jul 11, 2024 •

edited

Loading

Lilferrit commented Jul 12, 2024 •

edited

Loading