Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casanovo's 'sequence' mode crashes when n_beams=10 #271

Closed
cfmelend opened this issue Dec 19, 2023 · 4 comments · Fixed by #306
Closed

Casanovo's 'sequence' mode crashes when n_beams=10 #271

cfmelend opened this issue Dec 19, 2023 · 4 comments · Fixed by #306
Assignees
Labels
bug Something isn't working

Comments

@cfmelend
Copy link
Contributor

I'm experiencing the following error when running Casanovo's dev branch in sequence mode with n_beams=10 (this occurs with predict_batch_size >= 4:

File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/bin/casanovo", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/proj/cas_revisions/casanovo/casanovo.py", line 142, in sequence
    runner.predict(peak_path, output)
  File "/net/noble/vol1/home/cfmelend/proj/cas_revisions/casanovo/denovo/model_runner.py", line 163, in predict
    self.trainer.predict(self.model, self.loaders.test_dataloader())
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 864, in predict
    return call._call_and_handle_interrupt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 903, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
    return self.predict_loop.run()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/loops/prediction_loop.py", line 122, in run
    self._predict_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/loops/prediction_loop.py", line 250, in _predict_step
    predictions = call._call_strategy_hook(trainer, "predict_step", *step_args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/miniconda3/envs/cas_rev/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 429, in predict_step
    return self.lightning_module.predict_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/proj/cas_revisions/casanovo/denovo/model.py", line 823, in predict_step
    self.forward(batch[0], batch[1]),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/proj/cas_revisions/casanovo/denovo/model.py", line 199, in forward
    return self.beam_search_decode(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/noble/vol1/home/cfmelend/proj/cas_revisions/casanovo/denovo/model.py", line 271, in beam_search_decode
    self._cache_finished_beams(
  File "/net/noble/vol1/home/cfmelend/proj/cas_revisions/casanovo/denovo/model.py", line 538, in _cache_finished_beams
    heapadd(
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Predicting DataLoader 0:  58%|█████▊    | 1725/2952 [3:15:59<2:19:24,  0.15it/s]

This error doesn't appear to show up for n_beams < 10 or n_beams > 10. I've attached what appears to be the batch (32 spectra) that triggers this error consistently to the post.
single_problem_batch.txt

@cfmelend cfmelend added the bug Something isn't working label Dec 19, 2023
@bittremieux bittremieux self-assigned this Dec 21, 2023
@bittremieux
Copy link
Collaborator

I can't reproduce it with Casanovo v4.0.1. Can you try again with the latest release and check whether there are any differences between my log/output and the results you obtain?

out.log
out.mztab.txt

@cfmelend
Copy link
Contributor Author

cfmelend commented Jan 8, 2024

Running inference on the single problematic batch with the default settings from Casanovo v4.0.1 and the new default ckpt file I'm unable to reproduce the bug.

@Peggyying
Copy link

I'm experiencing the same error when running Casanovo's evaluate mode with n_beams>=3 :

Seed set to 454
INFO: Casanovo version 4.1.0
INFO: Sequencing and evaluating peptides from:
INFO:   /my/data/cross.cat.test.mgf
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
/my/data/cross.cat.test.mgf: 27142spectra [00:07, 3869.78spectra/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Validation DataLoader 0:   0%|                                                                                     | 0/27 [00:00<?, ?it/s]WARNING: /anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1024. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

WARNING: /anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 1024. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Validation DataLoader 0:  30%|██████████████████████▊                                                      | 8/27 [06:41<15:53,  0.02it/s]Traceback (most recent call last):
  File "/anaconda3/envs/casanovo_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/anaconda3/envs/casanovo_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/my/casanovo-4.1.0/casanovo/casanovo.py", line 492, in <module>
    main()
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/my/casanovo-4.1.0/casanovo/casanovo.py", line 174, in evaluate
    runner.evaluate(annotated_peak_path)
  File "/my/casanovo-4.1.0/casanovo/denovo/model_runner.py", line 133, in evaluate
    self.trainer.validate(self.model, self.loaders.test_dataloader())
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 645, in validate
    return call._call_and_handle_interrupt(
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 685, in _validate_impl
    results = self._run(model, ckpt_path=ckpt_path)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1028, in _run_stage
    return self._evaluation_loop.run()
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 134, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 391, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 403, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "/my/casanovo-4.1.0/casanovo/denovo/model.py", line 767, in validation_step
    for spectrum_preds in self.forward(batch[0], batch[1]):
  File "/my/casanovo-4.1.0/casanovo/denovo/model.py", line 200, in forward
    return self.beam_search_decode(
  File "/my/casanovo-4.1.0/casanovo/denovo/model.py", line 272, in beam_search_decode
    self._cache_finished_beams(
  File "/my/casanovo-4.1.0/casanovo/denovo/model.py", line 539, in _cache_finished_beams
    heapadd(
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Validation DataLoader 0:  30%|██▉       | 8/27 [07:08<16:57,  0.02it/s]

The log file is casanovo-4.1.0.log.
The error doesn't show up for n_beams=1 or 2.

@bittremieux bittremieux reopened this Feb 25, 2024
@bittremieux
Copy link
Collaborator

bittremieux commented Feb 25, 2024

Ok, I think I figured out what the issue is. Here we cache finished beams based on their scores. The problem arises in the edge case that there is a cached beam and a beam to be added that have identical scores. In that case, heappush will use the second (and so forth) element to determine the order in which entries should occur in the heap. Here, the second element is an array (of amino acid scores), leading to the "ambiguous truth value" error. And this occurs when using a higher number of beams because in that case presumably there will be beams cached with score 0, whereas real scores are extremely unlikely to be identical.

Tricky bug, I don't have a good solution of the top off my head. We should discuss how to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants