Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Sequencing Run Logging #343

Merged
merged 36 commits into from
Jul 10, 2024
Merged

Enhanced Sequencing Run Logging #343

merged 36 commits into from
Jul 10, 2024

Conversation

Lilferrit
Copy link
Contributor

Implemented additional logging functionality in order to provide an enhanced end of sequence run log. This takes the form of an end of sequencing run report, an example is shown below:

INFO: ======= Sequencing Run Report =======
INFO: Sequencing Run Start Timestamp: 1719004519s
INFO: Sequencing Run End Timestamp: 1719004539s
INFO: Time Elapsed: 20s
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo sequence sample_data\sample_preprocessed_spectra.mgf -o foo.mztab
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Sequencing run date: 06/21/24 14:15:39
INFO: Sequenced 128 spectra
INFO: Sequence Score CMF: {0.0: 96, 0.5: 96, 0.9: 66, 0.95: 60, 0.99: 8}
INFO: Max Sequence Length: 17
INFO: Min Sequence Length: 6
INFO: Max GPU Memory Utilization: 400mb

All the data to generate this report is recorded on the fly as inference is conducted (i.e. no file parsing). To facilitate this the functionality of the data submodule was extended to add infastructure for run time logging. This will hopefully make it relatively easy to further extend logging and other io functionality in the future.

@Lilferrit Lilferrit linked an issue Jun 21, 2024 that may be closed by this pull request
@bittremieux bittremieux changed the base branch from main to dev June 24, 2024 19:41
@bittremieux
Copy link
Collaborator

Changed base to dev, make sure to reflect that locally.

@Lilferrit
Copy link
Contributor Author

Lilferrit commented Jun 24, 2024

Added more information to the end of run report, namely the number of skipped spectra. Unfortunately the model never sees the spectra that are skipped due to pre-check errors such as having invalid precursor info, so these aren't reflected in the end of run report. They do however appear at other points in the log. A sample log is available below:

INFO: ======= Sequencing Run Report =======
INFO: Sequencing Run Start Timestamp: 1719271695s
INFO: Sequencing Run End Timestamp: 1719271714s
INFO: Time Elapsed: 18s
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo sequence sample_data\sample_preprocessed_spectra.mgf -o data/foo -c casanovo.yaml
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Sequencing run date: 06/24/24 16:28:34
INFO: Attempted to sequence 128 spectra
INFO: Sequenced 72 spectra
INFO: Skipped 56 spectra
INFO: Sequenced 56.25% of total spectra
INFO: Skipped 43.75% of total spectra
INFO: Score Distribution:
INFO: 54 spectra (75.00%) scored >= 0.0
INFO: 54 spectra (75.00%) scored >= 0.5
INFO: 49 spectra (68.06%) scored >= 0.9
INFO: 46 spectra (63.89%) scored >= 0.95
INFO: 8 spectra (11.11%) scored >= 0.99
INFO: Max Sequence Length: 9
INFO: Min Sequence Length: 6
INFO: Max GPU Memory Utilization: 398mb

@wsnoble wsnoble requested a review from bittremieux June 25, 2024 16:32
@bittremieux
Copy link
Collaborator

Can you describe a bit what the logging_io and prediction_io are and how they're intended to be used? You mentioned it during the meeting yesterday, but it would be useful to have it briefly written out as well.

@bittremieux bittremieux linked an issue Jun 26, 2024 that may be closed by this pull request
@Lilferrit
Copy link
Contributor Author

Can you describe a bit what the logging_io and prediction_io are and how they're intended to be used? You mentioned it during the meeting yesterday, but it would be useful to have it briefly written out as well.

For sure, prediction_io contains the definition for the PredictionWriter interface as well as the definition for PredictionMultiWriter. For the most part, all of the functions in the PredictionWriter class were already implemented by MztabWriter. The PredictonWriter interface defines three member functions, log_prediction, log_skipped_spectra, and save which are all optional to implement by a class that implements the interface. The PredictionWriter interface is intended to provide a consistent interface for writing Casanovo sequence predictions to external IO (namely files and loggers).

PredictionMultiWriter implements the PredictionWriter interface, and maintains an internal list of PredictionWriters. When one of the PredictionWriter functions is called on a PredictionMultiWriter it'll call that member function on all of the PredictionWriters in it's internal list.

logger_io contains the definition of LogPredictionWriter, a PredictionWriter that is used to generate the end of sequencing run report. It maintains a table of predictions (which contains only the predicted peptide sequence and the search engine score at the time of writing), and writes to this table when log_prediction is called. When save is called it writes the end of run report to a logger object that is provided via the constructor.

@bittremieux
Copy link
Collaborator

Imo this is a bit over-engineered. Instead, we can just log directly where specific information is available using the Python loggers, rather than having to pass it all upstream to these new interfaces. Your implementation is precisely with the Python logger does after all: having a universal interface that multiple writers can hook into and write to different outputs. So we don't need to reinvent that.

So I'd just log the PSM statistics in the MzTabWriter, which is where you have the PSMs. Skipped spectra are logged in the reader, which is where the skipping actually happens (or will happen after the DepthCharge upgrade). (The current spectrum skipping logging is also incorrect and adds non-negligible overhead, which should be avoid for only logging.) Other stuff can similarly be logged in other relevant locations, etc.

@Lilferrit
Copy link
Contributor Author

Oh ok sounds good - so just to be clear should all of the PSM processing (e.g. the score distribution calculation) and logging happen in the MzTabWriter? Also, should skipped spectra logging just be delayed until the depth charge upgrade is implemented?

@bittremieux
Copy link
Collaborator

Yes, I think that's the logical place, considering that all PSMs are aggregated there.

Of course the over-engineering is my personal feeling. We could ask @wfondrie to weigh in as well.

Logging the number of skipped spectra can indeed be delayed for now.

@Lilferrit
Copy link
Contributor Author

I agree that this solution is overengineered for addressing just the post run logging. My reasoning for implementing it this way is having a PSM io interface would make it relatively easy to extend the PSM io in general. For example if we want to support other output formats like parquet or csv in the future and add a command line option to specify multiple output formats (say you want to write the PSMs to csv and mzTab) it would be relatively easy to do this, or if an end user wants to write PSMs to an external component like a database they could write their own PredictionWriter module to handle this.

@bittremieux
Copy link
Collaborator

optimization

@Lilferrit
Copy link
Contributor Author

Fair enough, I'll get everything moved to the mzTab writer.

Copy link
Collaborator

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these comments are very small things that you can quickly fix.

One bigger comment is that on reflection the output writer is also not the best place to do this logging. Instead, a better location seems casanovo.py, where the sequencing is actually executed. The PSM statistics can still be retrieved from the MztabWritter through the ModelRunner.

Additionally, we could still try to generalize it a bit. The runtime, hostname, GPU memory consumption, etc. are equally relevant to a training run. So maybe some general logging for any Casanovo execution, with then a small adaptation for the sequencing run to include PSM statistics there as well.

casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/data/ms_io.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
casanovo/utils.py Outdated Show resolved Hide resolved
@Lilferrit Lilferrit merged commit f090e5f into dev Jul 10, 2024
7 checks passed
@Lilferrit Lilferrit deleted the run-report-logging branch July 10, 2024 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add info to log files Add more detailed logging to assist user debugging
2 participants