Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent metrics result between eval and denovo mode, and possible reason #122

Closed
irleader opened this issue Jan 17, 2023 · 17 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@irleader
Copy link

This is a follow-up of issue #111, thanks to all the information provided, I am able to find a possible bug and its reason.

I did experiment with weights "pretrained_excl_yeast.ckpt", deepnovo original raw yeast dataset, and Casanovo version 3.1.0.

In eval mode (results displayed on console):
aa_precision: 0.7057459339992785
aa_recall: 0.7071716086435299

pep_recall: 0.5046985050978404

In denovo mode (results manually calculated from output mztab file):
aa_precision: 0.6827751198617065
aa_recall: 0.6843725335842246

pep_recall: 0.5046985051027288

There is difference between results from eval and denovo mode (aa_precision, aa_recall), so I studies model.py and its pytorch lightning implementation.

I first excluded the possibility that this may be caused by predictions without '$' or with multiple '$$' (it turns out there are no such predictions), because this line of code removes the first position of prediction without checking whether it is '$', I am referring to v3.2.0, the code in v3.1.0 is similar.

Then I checked validation part of the code, and find the inconsistence might be caused by the way pytorch lightning accumulates metrics from each step and calculates per-epoch metrics. So I manually accumulates the prediction data from each step and calculate metrics for one epoch, and find it is the same as the results from denovo mode, which verifies my guess. For manual calculation, I use methods from evaluate.py.

Manual calculation from validation_step() output:
aa_precision: 0.6827751198617065
aa_recall: 0.6843725335842246

pep_recall: 0.5046985051027288

In order to further verify, I tried with deepnovo raw ricebean dataset, and see the same pattern.

In eval mode (results displayed on console):
aa_precision: 0.5835929214378351
aa_recall: 0.583695772587527

pep_recall: 0.4223428193208169

In denovo mode (results manually calculated from output mztab file):
aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586

pep_recall: 0.4223428193248386

Manual calculation from validation_step() output:
aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586

pep_recall: 0.4223428193248386

So the conclusion is, pytorch lightning can accumulate pep_recall from each step and calculate it correctly for each epoch, but it may fail the same job for aa_precision and aa_recall.

Ps. I again would like to request numerical values of Casanovo's AA/peptide precision/recall for updated 9-species benchmark (Figure S5 from preprint). Thanks a lot.

@bittremieux bittremieux added the bug Something isn't working label Jan 17, 2023
@melihyilmaz
Copy link
Collaborator

Hi,

Thanks for pointing out the pytorch lightening bug relating to metric accumulation over steps, looks like that's reason behind discrepancy in results, and also the duplicate stop token checking - this shouldn't be happening in the current implementation but we still added a check with the recent PR #123 to catch these.

Also sorry for the delay but you can find the Casanovo peptide/AA precision values at coverage=1 for the updated 9-species benchmark below (corresponding to Figure S2 and S5 from preprint):

mouse
peptide precision=0.546022191276223
AA precision=0.8246832939458176

ricebean
peptide precision=0.762752583426578
AA precision=0.8981027063629017

human
peptide precision=0.7320067005545287
AA precision=0.8993333233520767

tomato
peptide precision=0.7904856253285351
AA precision=0.909782514214078

clambacteria
peptide precision=0.6535406820543996
AA precision=0.868507208789983

honeybee
peptide precision=0.6471156883806154
AA precision=0.8149989493510443

mmazei
peptide precision=0.7260779989677485
AA precision=0.8701884249037223

bacillus
peptide precision=0.7817952157663147
AA precision=0.8807972308692891

yeast
peptide precision=0.8383814663340254
AA precision=0.915618641976876

@irleader
Copy link
Author

Thank you very much for the values!

For preprint Figure S2 and S5 only:

  1. I assume the results are from Casanovo v3.2.0 trained with MASSIVEKB dataset and tested on each of the revised 9 species, with beam=10 and precursor window size=30ppm as mentioned in the preprint?

  2. For Casanovo bm (Casanovo v2.1.1) in Figure S2 and S5, is it also trained with MASSIVEKB dataset, or it is trained with the other 8 species in the revised benchmark?

  3. MASSIVEKB dataset has all 7 modifications, while current release of revised 9 species benchmark doesn't contain the 4 N-terminal modifications yet (Modifications and their representations in revised 9-species benchmark #125). For these benchmark results, do you use exactly the same data as in this link without N-terminal modifications?

Thanks!

@melihyilmaz
Copy link
Collaborator

  1. Yes
  2. Casanovo_bm was trained on the older version of the benchmark, i.e. trained on 8 species from the older benchmark and tested on the remaining 9th species on the original, similar to DeepNovo.
  3. The benchmark dataset these results were obtained from includes N-terminal modifications (see Figure S4 for results when a subset of PTMs that DeepNovo and Novor couldn't predict were excluded from the benchmark). See the update on Modifications and their representations in revised 9-species benchmark #125 and check out the regenerated benchmark once it's available.

@irleader
Copy link
Author

A similar situation has been discover in train mode, the "AA precision AA recall Peptide recall" of validation dataset displayed on console is also much higher than the actual accuracy, which might also be caused by pytorch-lightning codes in model.py. I would also suggest not use these values, I have compare them with manually calculated AA precision and peptide recall values from output mztab files, these displayed values are consistently higher. They can only be used as an indicator of when to stop training but not accurate performance of benchmark dataset.

@melihyilmaz
Copy link
Collaborator

Thanks for pointing this out - we're also relying on manual calculation of precision values when reporting results

@wsnoble
Copy link
Contributor

wsnoble commented Jan 31, 2023

Notes from our discussion just now:

One hypothesis (from Wout) is that this may be due to a mismatch in the batch size, and it throws the calculation off.

Another is that it has to do with how Pytorch Lightning aggregates results at the level of epoch. But if that's the case, then others would have run into this as well. Googling did not turn anything up.

@irleader
Copy link
Author

irleader commented Feb 1, 2023

Thanks for letting me know!

@irleader irleader closed this as completed Feb 1, 2023
@wsnoble wsnoble reopened this Feb 1, 2023
@wsnoble
Copy link
Contributor

wsnoble commented Feb 1, 2023

I am reopening this because we are still hoping to track down this discrepancy. It might not be immediate, though.

@bittremieux
Copy link
Collaborator

Although the values seem different, this should also have been impacted by #252. We should check again after merging #253.

@Lilferrit Lilferrit self-assigned this Jul 3, 2024
@Lilferrit
Copy link
Contributor

Lilferrit commented Jul 11, 2024

I wrote a script to calculate the AA and Peptide precision of the Casanovo PSMs, and I am seeing a minor discrepancy between the output of my script and Casanovo's evaluate mode. I ran the script and evaluate mode on the sample data, which has no skipped spectra during sequencing, so the error is unlikely to be originating from that. From what I can tell the discrepancy is either from an error in my script or we're still seeing batch effects in eval mode. Here's the output from my script and eval mode:

(casanovo_env) C:\Users\gavin\OneDrive\Documents\casanovo>casanovo evaluate sample_data\sample_preprocessed_spectra.mgf -c data/casanovo.yaml
WARNING: Dataloader multiprocessing is currently not supported on Windows or MacOS; using only a single thread.
Seed set to 454
INFO: Model weights file C:\Users\gavin\AppData\Local\casanovo\casanovo_v420_v4_2_0.ckpt retrieved from local cache
INFO: Casanovo version 4.1.1.dev15+gcedfaa7.d20240625
INFO: Sequencing and evaluating peptides from:
INFO:   sample_data\sample_preprocessed_spectra.mgf
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
sample_data\sample_preprocessed_spectra.mgf: 128spectra [00:00, 3208.59spectra/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Validation DataLoader 0:   0%|                                                                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]WARNING: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
WARNING: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
WARNING: D:\anaconda3\envs\casanovo_env\lib\site-packages\lightning\pytorch\utilities\data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

WARNING: D:\anaconda3\envs\casanovo_env\lib\site-packages\lightning\pytorch\utilities\data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  0.27it/s]INFO: Step       Train loss      Valid loss      Peptide precision       AA precision
INFO: 0 nan     0.376451        0.492188        0.745707
Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  0.26it/s] 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃         Validate metric         ┃          DataLoader 0           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│   AA precision at coverage=1    │       0.7457072734832764        │
│ Peptide precision at coverage=1 │            0.4921875            │
│          valid_CELoss           │        0.376451313495636        │
└─────────────────────────────────┴─────────────────────────────────┘
INFO: ======= End of Run Report =======
INFO: Run Start Time: 24/07/11 10:45:17
INFO: Run End Time: 24/07/11 10:45:37
INFO: Time Elapsed: 0:00:19.993431
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo evaluate sample_data\sample_preprocessed_spectra.mgf -c data/casanovo.yaml
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Max GPU Memory Utilization: 381 MiB

(casanovo_env) C:\Users\gavin\OneDrive\Documents\casanovo>python val_bug_test.py
AA Precision: 0.7432542927167355
Pep Precision: 0.48437499996215816

Also, here is the test script:

import depthcharge.masses
import casanovo
import depthcharge
import pyteomics.mztab

import casanovo.denovo
import casanovo.denovo.evaluate


def process_mgf_file(mgf_path):
    with open(mgf_path) as f:
        return [
            line[4:].strip() for line in f if line.startswith("SEQ=")
        ]


def main():
    mz_tab = pyteomics.mztab.MzTab("foo.mzTab")
    output = mz_tab.spectrum_match_table["sequence"].tolist()
    ground_truth = process_mgf_file("sample_data\sample_preprocessed_spectra.mgf")
    peptide_masses = depthcharge.masses.PeptideMass()

    aa_precision, _, pep_precision = casanovo.denovo.evaluate.aa_match_metrics(
        *casanovo.denovo.evaluate.aa_match_batch(
            ground_truth,
            output,
            peptide_masses.masses
        )
    )

    print(f"AA Precision: {aa_precision}")
    print(f"Pep Precision: {pep_precision}")

if __name__ == "__main__":
    main()

@bittremieux
Copy link
Collaborator

Your script is correct, so I think it's still an issue with aggregating the results across multiple batches.

Maybe this can be relatively easily checked by running evaluation with a single batch only (by expanding the batch size and/or reducing the number of spectra). Although, come to think of it, the batch size for inference should standard be 1024, so maybe that's already the case. Could you double-check this?

@wsnoble
Copy link
Contributor

wsnoble commented Jul 12, 2024

I haven't looked at the code, but doesn't it seem a bit crazy to try to compute these metrics in a batchwise fashion? Why not just aggregate a score vector and a label vector and then do a single call to the metric calculation at the end?

@bittremieux
Copy link
Collaborator

Now it's calculated at the end of each validation batch, where the loss is also calculated. Otherwise you have to keep all of the predictions and true labels in memory across the whole run, which can easily be on the order of millions. So in terms of code it's not really crazy but actually a natural way to do it. But maybe leading to slightly incorrect results then.

@Lilferrit
Copy link
Contributor

Lilferrit commented Jul 12, 2024

I checked the default config and batch size for inference does indeed default to 1024. As for the batch effects, I may be misunderstanding the amino acid and peptide precision scores but wouldn't be possible to keep a count of the number of correct amino acids and peptides, as well as the total number of amino acids and peptides, and use this values to do the precision scores calculation at the end of an evaluation run?

@bittremieux
Copy link
Collaborator

We should first verify what the actual problem is. This batched data aggregation was only my hypothesis, we haven't isolated the issue yet. You could debug the code I referred to a bit more to see what the values there are and try to find the point where they deviate.

@Lilferrit
Copy link
Contributor

After some further investigation it turns out that this issue persists even in the case where there's only one validation step. I'm still trying to track down the exact discrepancy though.

@Lilferrit
Copy link
Contributor

The tentative implementation of #355 calculates the evaluation metrics in a way that is roughly equivalent to the test script. I'll reexamine this once #355 has been merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants