Enhanced Sequencing Run Logging #343

Lilferrit · 2024-06-21T21:20:06Z

Implemented additional logging functionality in order to provide an enhanced end of sequence run log. This takes the form of an end of sequencing run report, an example is shown below:

INFO: ======= Sequencing Run Report =======
INFO: Sequencing Run Start Timestamp: 1719004519s
INFO: Sequencing Run End Timestamp: 1719004539s
INFO: Time Elapsed: 20s
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo sequence sample_data\sample_preprocessed_spectra.mgf -o foo.mztab
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Sequencing run date: 06/21/24 14:15:39
INFO: Sequenced 128 spectra
INFO: Sequence Score CMF: {0.0: 96, 0.5: 96, 0.9: 66, 0.95: 60, 0.99: 8}
INFO: Max Sequence Length: 17
INFO: Min Sequence Length: 6
INFO: Max GPU Memory Utilization: 400mb

All the data to generate this report is recorded on the fly as inference is conducted (i.e. no file parsing). To facilitate this the functionality of the data submodule was extended to add infastructure for run time logging. This will hopefully make it relatively easy to further extend logging and other io functionality in the future.

bittremieux · 2024-06-24T19:42:10Z

Changed base to dev, make sure to reflect that locally.

…-report-logging

Lilferrit · 2024-06-24T23:32:43Z

Added more information to the end of run report, namely the number of skipped spectra. Unfortunately the model never sees the spectra that are skipped due to pre-check errors such as having invalid precursor info, so these aren't reflected in the end of run report. They do however appear at other points in the log. A sample log is available below:

INFO: ======= Sequencing Run Report =======
INFO: Sequencing Run Start Timestamp: 1719271695s
INFO: Sequencing Run End Timestamp: 1719271714s
INFO: Time Elapsed: 18s
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo sequence sample_data\sample_preprocessed_spectra.mgf -o data/foo -c casanovo.yaml
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Sequencing run date: 06/24/24 16:28:34
INFO: Attempted to sequence 128 spectra
INFO: Sequenced 72 spectra
INFO: Skipped 56 spectra
INFO: Sequenced 56.25% of total spectra
INFO: Skipped 43.75% of total spectra
INFO: Score Distribution:
INFO: 54 spectra (75.00%) scored >= 0.0
INFO: 54 spectra (75.00%) scored >= 0.5
INFO: 49 spectra (68.06%) scored >= 0.9
INFO: 46 spectra (63.89%) scored >= 0.95
INFO: 8 spectra (11.11%) scored >= 0.99
INFO: Max Sequence Length: 9
INFO: Min Sequence Length: 6
INFO: Max GPU Memory Utilization: 398mb

bittremieux · 2024-06-26T12:02:42Z

Can you describe a bit what the logging_io and prediction_io are and how they're intended to be used? You mentioned it during the meeting yesterday, but it would be useful to have it briefly written out as well.

Lilferrit · 2024-06-26T16:36:30Z

Can you describe a bit what the logging_io and prediction_io are and how they're intended to be used? You mentioned it during the meeting yesterday, but it would be useful to have it briefly written out as well.

For sure, prediction_io contains the definition for the PredictionWriter interface as well as the definition for PredictionMultiWriter. For the most part, all of the functions in the PredictionWriter class were already implemented by MztabWriter. The PredictonWriter interface defines three member functions, log_prediction, log_skipped_spectra, and save which are all optional to implement by a class that implements the interface. The PredictionWriter interface is intended to provide a consistent interface for writing Casanovo sequence predictions to external IO (namely files and loggers).

PredictionMultiWriter implements the PredictionWriter interface, and maintains an internal list of PredictionWriters. When one of the PredictionWriter functions is called on a PredictionMultiWriter it'll call that member function on all of the PredictionWriters in it's internal list.

logger_io contains the definition of LogPredictionWriter, a PredictionWriter that is used to generate the end of sequencing run report. It maintains a table of predictions (which contains only the predicted peptide sequence and the search engine score at the time of writing), and writes to this table when log_prediction is called. When save is called it writes the end of run report to a logger object that is provided via the constructor.

bittremieux · 2024-06-28T13:31:29Z

Imo this is a bit over-engineered. Instead, we can just log directly where specific information is available using the Python loggers, rather than having to pass it all upstream to these new interfaces. Your implementation is precisely with the Python logger does after all: having a universal interface that multiple writers can hook into and write to different outputs. So we don't need to reinvent that.

So I'd just log the PSM statistics in the MzTabWriter, which is where you have the PSMs. Skipped spectra are logged in the reader, which is where the skipping actually happens (or will happen after the DepthCharge upgrade). (The current spectrum skipping logging is also incorrect and adds non-negligible overhead, which should be avoid for only logging.) Other stuff can similarly be logged in other relevant locations, etc.

Lilferrit · 2024-06-28T17:11:37Z

Oh ok sounds good - so just to be clear should all of the PSM processing (e.g. the score distribution calculation) and logging happen in the MzTabWriter? Also, should skipped spectra logging just be delayed until the depth charge upgrade is implemented?

bittremieux · 2024-06-28T17:42:04Z

Yes, I think that's the logical place, considering that all PSMs are aggregated there.

Of course the over-engineering is my personal feeling. We could ask @wfondrie to weigh in as well.

Logging the number of skipped spectra can indeed be delayed for now.

Lilferrit · 2024-06-28T18:29:51Z

I agree that this solution is overengineered for addressing just the post run logging. My reasoning for implementing it this way is having a PSM io interface would make it relatively easy to extend the PSM io in general. For example if we want to support other output formats like parquet or csv in the future and add a command line option to specify multiple output formats (say you want to write the PSMs to csv and mzTab) it would be relatively easy to do this, or if an end user wants to write PSMs to an external component like a database they could write their own PredictionWriter module to handle this.

bittremieux · 2024-06-28T18:57:46Z

Lilferrit · 2024-06-28T19:19:04Z

Fair enough, I'll get everything moved to the mzTab writer.

bittremieux

A lot of these comments are very small things that you can quickly fix.

One bigger comment is that on reflection the output writer is also not the best place to do this logging. Instead, a better location seems casanovo.py, where the sequencing is actually executed. The PSM statistics can still be retrieved from the MztabWritter through the ModelRunner.

Additionally, we could still try to generalize it a bit. The runtime, hostname, GPU memory consumption, etc. are equally relevant to a training run. So maybe some general logging for any Casanovo execution, with then a small adaptation for the sequencing run to include PSM statistics there as well.

casanovo/data/ms_io.py

casanovo/utils.py

Lilferrit added 13 commits June 18, 2024 15:06

implemented report_gen submodule

da45608

report_gen documentation

2d6b5c3

report_gen submodule test

28fa6c8

naming conventions

97e5bf1

naming conventions

4f635f9

PredictionWriter virtual class

aa43a8c

multi prediction writer

46bb62c

LogPredicitonWriter wip

40eecb1

implemented logger io

2d7effa

removed report gen submodule

a7beddf

logger io test

65b5a83

logging info

1f656b6

implemented end of run logging

4d2fab1

Lilferrit linked an issue Jun 21, 2024 that may be closed by this pull request

Add info to log files #295

Closed

Lilferrit and others added 8 commits June 21, 2024 14:20

Merge branch 'main' into run-report-logging

9e903e7

Generate new screengrabs with rich-codex

22f26c7

logger io test fix

2f83bb7

formatting fixes

858704e

updated screeshots

6da1219

test file formatting

bf6c20c

Restrict NumPy to pre-2.0

ed1b841

Update changelog

968f60a

bittremieux changed the base branch from main to dev June 24, 2024 19:41

Lilferrit added 5 commits June 24, 2024 14:28

PredictionMultiWriter s\erialization

0b12fb8

log writer error handling

ff37b54

reformatting

dee9bf0

Merge branch 'hotfix_numpy' of github.com:Noble-Lab/casanovo into run…

411f717

…-report-logging

verified skipped spectra counter

19d8aa8

wsnoble requested a review from bittremieux June 25, 2024 16:32

bittremieux linked an issue Jun 26, 2024 that may be closed by this pull request

Add more detailed logging to assist user debugging #244

Closed

changelog merge confict

56ef340

migrated end of run report logging functionality to ms_io

79c706e

bittremieux requested changes Jul 2, 2024

View reviewed changes

moved logging utility functions to util.py

4942a48

bittremieux requested changes Jul 4, 2024

View reviewed changes

requested changes

66860e2