-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
major improvements of mzml statistics #37
Conversation
Fixed ms2rescore bugs
WalkthroughThe pull request introduces updates to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (6)
quantmsutils/psm/psm_conversion.py (1)
144-145
: Consider documenting the q-value precedence.The logic for extracting global_qvalue from multiple sources is sound, but would benefit from a comment explaining the precedence order of these sources.
Add a comment like:
if hit.metaValueExists("MS:1001491"): global_qvalue = hit.getMetaValue("MS:1001491") + # Fall back to generic q-value if MS:1001491 is not available elif hit.metaValueExists("q-value"): global_qvalue = hit.getMetaValue("q-value")
quantmsutils/mzml/mzml_statistics.py (3)
65-65
: Update the docstring to reflect the function's behaviorThe docstring for
batch_write_mzml
mentions returning a pandas DataFrame, but the function actually writes data to Parquet files and returns anOptional[str]
. Updating the docstring will help maintain accurate documentation.
243-245
: Ensure consistent case-insensitive file extension handlingIn line 243, the code compares
Path(ms_path).suffix
directly to".d"
. This may lead to issues on case-sensitive filesystems if the file extension is uppercase (e.g.,.D
). To ensure consistent handling regardless of case, convert the suffix to lowercase before comparison:-if Path(ms_path).suffix == ".d": +if Path(ms_path).suffix.lower() == ".d":
259-264
: Simplify and standardize file extension handling in '_resolve_ms_path'The
valid_extensions
set includes both lowercase and mixed-case extensions. Since the file suffixes are converted to lowercase in the list comprehension, you can streamline thevalid_extensions
set:-valid_extensions = {'.d', '.mzml', '.mzML'} +valid_extensions = {'.d', '.mzml'}This ensures that all comparisons are case-insensitive and simplifies the code.
quantmsutils/rescoring/ms2rescore.py (2)
102-107
: Optimize feature extraction and error handlingCurrently, you are checking for the existence of each rescoring feature using
metaValueExists
and returningNone
if any feature is missing. This may not provide detailed information about which feature is missing. Consider collecting all missing features and reporting them for better debugging.Apply this diff to collect and report all missing features:
rescoring_features = {} +missing_features = [] for key in self.rescoring_features: feature_exists = peptide_hit.metaValueExists(key) if not feature_exists: - return None + missing_features.append(key) else: rescoring_features[key] = float(peptide_hit.getMetaValue(key)) +if missing_features: + logging.warning(f"Missing features {missing_features} for peptide hit {peptide_hit.getSequence().toString()}") + return None
235-242
: Adjust log level for skipped PSMs messageThe message about removed PSMs without search engine features is currently logged as a warning. If skipping PSMs without features is an expected part of the workflow, consider changing the log level to
INFO
to reduce unnecessary warning messages.Apply this diff to change the log level:
if reader.skip_invalid_psm != 0: - logging.warning( + logging.info( f"Removed {reader.skip_invalid_psm} PSMs without search engine features!" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
pyproject.toml
(1 hunks)quantmsutils/__init__.py
(1 hunks)quantmsutils/mzml/mzml_statistics.py
(2 hunks)quantmsutils/psm/psm_conversion.py
(1 hunks)quantmsutils/rescoring/ms2rescore.py
(2 hunks)recipe/meta.yaml
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- pyproject.toml
- quantmsutils/init.py
- recipe/meta.yaml
🔇 Additional comments (1)
quantmsutils/psm/psm_conversion.py (1)
144-145
: Verify q-value data type consistency.
Let's ensure the q-value meta value maintains consistent data type with other global_qvalue sources.
✅ Verification successful
Let me gather more information about the data type handling in the codebase.
Let me check one more aspect to understand the type handling in the OpenMS interface.
Based on the gathered information, I can now provide the final response:
Q-value data type is consistently handled
The codebase shows consistent handling of q-values across different sources:
- Q-values from
getScore()
are numeric (float) by default in pyOpenMS - Q-values from
getMetaValue()
for both "q-value" and "MS:1001491" are stored inglobal_qvalue
which is initialized asnp.nan
(numeric) - All score-related values in the codebase are handled as floating-point numbers
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the data types of q-values across different idXML files
# Search for q-value related code to understand type handling
rg -A 3 "q-value|MS:1001491" --type py
# Look for any type conversion or validation
ast-grep --pattern 'getMetaValue("q-value")'
ast-grep --pattern 'getMetaValue("MS:1001491")'
Length of output: 2712
Script:
#!/bin/bash
# Look for global_qvalue assignments and type handling
rg -A 2 "global_qvalue =" --type py
# Look for any type validation or conversion of q-values
ast-grep --pattern 'float($_)'
Length of output: 2835
Script:
#!/bin/bash
# Look for OpenMS type handling and initialization
rg -A 3 "import pyopenms|from pyopenms" --type py
rg -A 3 "getScore|setScore" --type py
Length of output: 3484
accession.decode() for accession in peptide_hit.extractProteinAccessionsSet() | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify the need for decoding protein accessions
In the protein_list
comprehension, you are using accession.decode()
on each accession obtained from peptide_hit.extractProteinAccessionsSet()
. If accession
is already a string, calling decode()
will raise an AttributeError
. Confirm whether accession
is a bytes object that requires decoding. If accession
is a string, you should remove the decode()
call.
Apply this diff if accession
is already a string:
protein_list=[
- accession.decode() for accession in peptide_hit.extractProteinAccessionsSet()
+ accession for accession in peptide_hit.extractProteinAccessionsSet()
],
Committable suggestion skipped: line range outside the PR's diff.
self.protein_ids, self.peptide_ids = self._parse_idxml() | ||
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0]) | ||
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential IndexError when accessing peptide IDs and hits
In the __init__
method of IDXMLReaderPatch
, you are accessing self.peptide_ids[0].getHits()[0]
to initialize self.user_params_metadata
and self.rescoring_features
. If self.peptide_ids
is empty or if self.peptide_ids[0].getHits()
is empty, this will raise an IndexError
. Consider adding a check to ensure that peptide IDs and hits are available before accessing them.
Apply this diff to handle the potential empty lists:
def __init__(self, filename: Union[Path, str], *args, **kwargs) -> None:
super().__init__(filename, *args, **kwargs)
self.protein_ids, self.peptide_ids = self._parse_idxml()
+ if not self.peptide_ids or not self.peptide_ids[0].getHits():
+ raise ValueError("No peptide IDs or peptide hits found in the idXML file.")
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0])
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0])
self.skip_invalid_psm = 0
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self.protein_ids, self.peptide_ids = self._parse_idxml() | |
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0]) | |
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0]) | |
self.protein_ids, self.peptide_ids = self._parse_idxml() | |
if not self.peptide_ids or not self.peptide_ids[0].getHits(): | |
raise ValueError("No peptide IDs or peptide hits found in the idXML file.") | |
self.user_params_metadata = self._get_userparams_metadata(self.peptide_ids[0].getHits()[0]) | |
self.rescoring_features = self._get_rescoring_features(self.peptide_ids[0].getHits()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)
278-281
: Enhance error message with available candidatesWhen multiple or no candidates are found, it would be helpful to include the list of found candidates in the error message.
if len(candidates) == 1: return candidates[0] - raise FileNotFoundError(f"No unique file found for {ms_path}") + if not candidates: + raise FileNotFoundError(f"No files found with extensions {valid_extensions} for {ms_path}") + raise FileNotFoundError(f"Multiple matching files found for {ms_path}: {candidates}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(2 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)
184-186
: LGTM! Efficient streaming implementation
The implementation correctly uses PyOpenMS's streaming mode through MzMLFile().transform()
for memory-efficient processing of mzML files.
24-24
:
Fix incorrect regular expression pattern
The pattern r"[scan|spectrum]=(\d+)"
will not correctly match "scan=" or "spectrum=". Square brackets []
define a character class that matches any single character within them.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
52-56
:
Add null check for precursors
Direct access to spectrum.getPrecursors()[0]
can cause an IndexError if no precursors are available.
if ms_level == 2:
- precursor = spectrum.getPrecursors()[0]
- charge_state = precursor.getCharge()
- exp_mz = precursor.getMZ()
+ precursors = spectrum.getPrecursors()
+ if precursors:
+ precursor = precursors[0]
+ charge_state = precursor.getCharge()
+ exp_mz = precursor.getMZ()
+ else:
+ charge_state = None
+ exp_mz = None
57-63
: 🛠️ Refactor suggestion
Consider streaming spectrum data instead of accumulating in memory
When id_only
is True, all spectrum data is accumulated in self.psm_parts
before being written at once. For large files, this could lead to memory issues.
Consider writing spectrum data in batches similar to how batch_data
is handled:
if self.id_only:
scan_id = self.scan_pattern.findall(spectrum.getNativeID())[0]
- self.psm_parts.append([
- scan_id, ms_level,
- mz_array.tolist(),
- intensity_array.tolist()
- ])
+ if len(self.psm_parts) >= self.batch_size:
+ self._write_spectrum_batch()
+ self.psm_parts.append({
+ "scan": scan_id,
+ "ms_level": ms_level,
+ "mz": mz_array.tolist(),
+ "intensity": intensity_array.tolist()
+ })
Likely invalid or redundant comment.
209-246
:
Improve error handling for parquet writer
The parquet_writer
is created outside the try block but closed in the finally block. This could raise an UnboundLocalError if writer creation fails.
- # Set up parquet writer
- parquet_writer = pq.ParquetWriter(
- output_path,
- parquet_schema,
- compression='gzip'
- )
-
+ parquet_writer = None
try:
+ # Set up parquet writer
+ parquet_writer = pq.ParquetWriter(
+ output_path,
+ parquet_schema,
+ compression='gzip'
+ )
+
# Stream data in batches
for chunk in pd.read_sql_query(
"""
SELECT
Id,
CASE
WHEN MsMsType IN (8, 9) THEN 2
WHEN MsMsType = 0 THEN 1
ELSE NULL
END as MsMsType,
NumPeaks,
MaxIntensity,
SummedIntensities,
Time,
Charge,
MonoisotopicMz
FROM frames
""",
conn,
chunksize=batch_size
):
# Prepare batch
chunk['AcquisitionDateTime'] = acquisition_date_time
# Convert to pyarrow table and write
batch_table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(batch_table)
finally:
- parquet_writer.close()
+ if parquet_writer:
+ parquet_writer.close()
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
quantmsutils/mzml/mzml_statistics.py (2)
139-143
: Add validation for batch_size parameterThe batch_size parameter should be validated to ensure it's a positive integer.
@click.option( "--batch_size", type=int, default=10000, + callback=lambda ctx, param, value: value if value > 0 else + ctx.fail('batch_size must be positive'), help="Number of rows to write in each batch" )
218-249
: Enhance error handling in batch_write_bruker_dThe function should handle potential database errors and provide more specific error messages.
try: # Stream data in batches for chunk in pd.read_sql_query( """ SELECT Id, CASE WHEN MsMsType IN (8, 9) THEN 2 WHEN MsMsType = 0 THEN 1 ELSE NULL END as MsMsType, NumPeaks, MaxIntensity, SummedIntensities, Time, Charge, MonoisotopicMz FROM frames """, conn, chunksize=batch_size ): # Prepare batch chunk['AcquisitionDateTime'] = acquisition_date_time # Convert to pyarrow table and write batch_table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(batch_table) + except sqlite3.Error as e: + print(f"Database error: {e}") + raise + except Exception as e: + print(f"Error processing Bruker data: {e}") + raise finally: parquet_writer.close()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(2 hunks)
🔇 Additional comments (4)
quantmsutils/mzml/mzml_statistics.py (4)
4-11
: LGTM: Required imports added for batch processing
The new imports (Optional, numpy, pyarrow) are appropriate for implementing batch processing and parquet file handling.
264-283
: LGTM: Robust file path resolution implementation
The _resolve_ms_path
function provides comprehensive file path resolution with proper validation of extensions and clear error messages.
24-24
:
Fix incorrect regular expression pattern
The pattern r"[scan|spectrum]=(\d+)"
is incorrect. Square brackets []
define a character class that matches any single character within them, not the words "scan" or "spectrum".
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
52-56
:
Add null check for precursors
Direct access to spectrum.getPrecursors()[0]
could raise an IndexError if no precursors are available.
if ms_level == 2:
+ precursors = spectrum.getPrecursors()
+ if not precursors:
+ charge_state = None
+ exp_mz = None
+ else:
- precursor = spectrum.getPrecursors()[0]
+ precursor = precursors[0]
charge_state = precursor.getCharge()
exp_mz = precursor.getMZ()
Likely invalid or redundant comment.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (3)
quantmsutils/mzml/mzml_statistics.py (3)
15-22
: Add type hints to improve code maintainabilityConsider adding type hints to the class attributes for better code maintainability and IDE support.
- def __init__(self, parquet_schema, output_path, batch_size=10000, id_only=False): + def __init__(self, parquet_schema: pa.Schema, output_path: str, batch_size: int = 10000, id_only: bool = False): self.parquet_schema = parquet_schema self.output_path = output_path self.batch_size = batch_size self.id_only = id_only - self.batch_data = [] - self.psm_parts = [] - self.parquet_writer = None + self.batch_data: list[dict] = [] + self.psm_parts: list[list] = [] + self.parquet_writer: Optional[pq.ParquetWriter] = None
264-265
: Improve error message for unsupported file typesThe error message could be more helpful by listing the supported file types.
- raise RuntimeError(f"Unsupported file type: {ms_path}") + supported_types = [".mzml", ".d"] + raise RuntimeError( + f"Unsupported file type: {ms_path}. " + f"Supported file types are: {', '.join(supported_types)}" + )
208-212
: Use parameterized query to prevent SQL injectionEven though the SQL query uses a fixed string, it's a good practice to use parameterized queries consistently.
- acquisition_date_time = conn.execute( - "SELECT Value FROM GlobalMetadata WHERE key='AcquisitionDateTime'" - ).fetchone()[0] + acquisition_date_time = conn.execute( + "SELECT Value FROM GlobalMetadata WHERE key = ?", + ("AcquisitionDateTime",) + ).fetchone()[0]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(2 hunks)
🔇 Additional comments (2)
quantmsutils/mzml/mzml_statistics.py (2)
24-24
:
Fix the regular expression pattern for scan ID extraction
The regex pattern r"[scan|spectrum]=(\d+)"
is incorrect as it uses square brackets which create a character class. This means it matches any single character that is 's', 'c', 'a', 'n', '|', 'p', 'e', 't', 'r', 'u', 'm', leading to incorrect scan ID extraction.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
52-56
:
Add null check for precursors to prevent IndexError
Direct access to spectrum.getPrecursors()[0]
can raise an IndexError if the spectrum has no precursors.
if ms_level == 2:
- precursor = spectrum.getPrecursors()[0]
- charge_state = precursor.getCharge()
- exp_mz = precursor.getMZ()
+ precursors = spectrum.getPrecursors()
+ if precursors:
+ precursor = precursors[0]
+ charge_state = precursor.getCharge()
+ exp_mz = precursor.getMZ()
+ else:
+ charge_state = None
+ exp_mz = None
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)
4-4
: Remove unused importThe
Dict
type hint is imported but never used in the code.-from typing import Optional, List, Dict +from typing import Optional, List🧰 Tools
🪛 Ruff (0.8.0)
4-4:
typing.Dict
imported but unusedRemove unused import:
typing.Dict
(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
.gitignore
(1 hunks)quantmsutils/mzml/mzml_statistics.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsutils/mzml/mzml_statistics.py
4-4: typing.Dict
imported but unused
Remove unused import: typing.Dict
(F401)
🔇 Additional comments (3)
quantmsutils/mzml/mzml_statistics.py (3)
299-318
: Well-implemented path resolution logic
The _resolve_ms_path
function is robustly implemented with:
- Proper path existence checking
- Case-insensitive extension handling
- Clear error messages
- Support for multiple file formats
198-210
: Efficient streaming implementation
The implementation correctly uses pyopenms streaming mode with BatchWritingConsumer, addressing the memory efficiency concerns raised in previous reviews.
29-29
:
Fix incorrect regular expression pattern
The current pattern r"[scan|spectrum]=(\d+)"
uses square brackets which creates a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
quantmsutils/mzml/mzml_statistics.py (2)
4-4
: Remove unused importThe
Dict
type hint is imported but never used in the code.-from typing import Optional, List, Dict +from typing import Optional, List🧰 Tools
🪛 Ruff (0.8.0)
4-4:
typing.Dict
imported but unusedRemove unused import:
typing.Dict
(F401)
289-301
: Enhance error handling for database operationsAdd specific error handling for database operations to provide more informative error messages.
try: # Stream data in batches for chunk in pd.read_sql_query(query, conn, chunksize=batch_size): chunk['AcquisitionDateTime'] = acquisition_date_time for col in schema.names: if col not in chunk.columns: chunk[col] = None batch_table = pa.Table.from_pandas(chunk, schema=schema) parquet_writer.write_table(batch_table) + except sqlite3.Error as e: + print(f"Database error: {e}") + raise + except Exception as e: + print(f"Error processing data: {e}") + raise finally: parquet_writer.close()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.0)
quantmsutils/mzml/mzml_statistics.py
4-4: typing.Dict
imported but unused
Remove unused import: typing.Dict
(F401)
🔇 Additional comments (4)
quantmsutils/mzml/mzml_statistics.py (4)
68-69
: 🛠️ Refactor suggestion
Optimize array handling to avoid unnecessary conversions
Converting numpy arrays to Python lists creates unnecessary copies of large arrays. Keep the data in numpy arrays throughout the process.
- "mz": mz_array.tolist(),
- "intensity": intensity_array.tolist()
+ "mz": mz_array,
+ "intensity": intensity_array
Likely invalid or redundant comment.
31-31
:
Fix incorrect regular expression pattern
The current pattern [scan|spectrum]=(\d+)
uses square brackets which creates a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
Likely invalid or redundant comment.
57-61
:
Add null check for precursors
Accessing spectrum.getPrecursors()[0]
without checking if precursors exist could raise an IndexError.
if ms_level == 2:
+ precursors = spectrum.getPrecursors()
+ if not precursors:
+ return
- precursor = spectrum.getPrecursors()[0]
+ precursor = precursors[0]
charge_state = precursor.getCharge()
exp_mz = precursor.getMZ()
Likely invalid or redundant comment.
261-266
: 🛠️ Refactor suggestion
Improve resource management for parquet writer
The parquet writer should be initialized within a try block to ensure proper cleanup in case of exceptions.
- parquet_writer = pq.ParquetWriter(
- output_path,
- schema=schema,
- compression='gzip'
- )
+ parquet_writer = None
+ try:
+ parquet_writer = pq.ParquetWriter(
+ output_path,
+ schema=schema,
+ compression='gzip'
+ )
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)
216-229
: Consider implementing memory-efficient streaming for large filesWhile the current implementation uses streaming for mzML files, it still accumulates data in batches before writing. For extremely large files, consider implementing a more memory-efficient approach:
- Use a smaller default batch size for very large files
- Add memory monitoring
- Consider implementing backpressure mechanisms
Would you like me to provide an example implementation of these optimizations?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(1 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)
31-31
:
Fix incorrect regular expression pattern
The pattern [scan|spectrum]=(\d+)
is incorrect as it creates a character class matching any single character in scan|pectrum
, not the words "scan" or "spectrum".
Apply this fix:
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
68-69
: 🛠️ Refactor suggestion
Optimize array handling to avoid unnecessary copies
Converting numpy arrays to Python lists creates unnecessary copies of large arrays.
Keep the data in numpy arrays:
- "mz": mz_array.tolist(),
- "intensity": intensity_array.tolist()
+ "mz": mz_array,
+ "intensity": intensity_array
129-129
:
Fix incorrect parquet writer usage
Using self.parquet_writer
instead of self.id_parquet_writer
for spectrum data is incorrect.
- self.parquet_writer.write_table(spectrum_table)
+ self.id_parquet_writer.write_table(spectrum_table)
261-266
: 🛠️ Refactor suggestion
Improve resource management for parquet writer
The parquet writer should be initialized within a try block to ensure proper cleanup in case of exceptions.
- parquet_writer = pq.ParquetWriter(
- output_path,
- schema=schema,
- compression='gzip'
- )
+ parquet_writer = None
+ try:
+ parquet_writer = pq.ParquetWriter(
+ output_path,
+ schema=schema,
+ compression='gzip'
+ )
# ... existing code ...
finally:
- parquet_writer.close()
+ if parquet_writer is not None:
+ parquet_writer.close()
57-61
:
Add validation for precursor existence
Accessing spectrum.getPrecursors()[0]
without checking if precursors exist could raise an IndexError.
Add validation:
if ms_level == 2:
+ precursors = spectrum.getPrecursors()
+ if not precursors:
+ return
- precursor = spectrum.getPrecursors()[0]
+ precursor = precursors[0]
charge_state = precursor.getCharge()
exp_mz = precursor.getMZ()
# Conflicts: # quantmsutils/mzml/mzml_statistics.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(1 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)
68-69
: 🛠️ Refactor suggestion
Optimize array handling to avoid unnecessary conversions
Converting numpy arrays to Python lists creates unnecessary copies of potentially large arrays. Consider keeping the data in numpy arrays.
- "mz": mz_array.tolist(),
- "intensity": intensity_array.tolist()
+ "mz": mz_array,
+ "intensity": intensity_array
Likely invalid or redundant comment.
31-31
:
Fix the regular expression pattern for scan/spectrum matching
The current pattern r"[scan|spectrum]=(\d+)"
uses square brackets which create a character class matching any single character in "scan|spectrum". This will not correctly match the intended patterns.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
Likely invalid or redundant comment.
261-266
: 🛠️ Refactor suggestion
Improve resource management for parquet writer
The parquet writer should be initialized within a context manager or try block to ensure proper cleanup in case of exceptions.
- parquet_writer = pq.ParquetWriter(
- output_path,
- schema=schema,
- compression='gzip'
- )
+ parquet_writer = None
+ try:
+ parquet_writer = pq.ParquetWriter(
+ output_path,
+ schema=schema,
+ compression='gzip'
+ )
+ except Exception as e:
+ print(f"Error initializing parquet writer: {e}")
+ raise
Likely invalid or redundant comment.
57-61
:
Add precursor check to prevent IndexError
The code directly accesses the first precursor without checking if any precursors exist, which could raise an IndexError.
if ms_level == 2:
+ precursors = spectrum.getPrecursors()
+ if not precursors:
+ return
- precursor = spectrum.getPrecursors()[0]
+ precursor = precursors[0]
charge_state = precursor.getCharge()
exp_mz = precursor.getMZ()
Likely invalid or redundant comment.
136-152
: 🛠️ Refactor suggestion
Improve error handling in finalize method
The finalize method should ensure proper cleanup of resources even if an exception occurs.
def finalize(self):
+ try:
if self.batch_data:
self._write_batch()
if self.id_only and self.psm_parts:
self._write_batch()
+ except Exception as e:
+ print(f"Error during finalization: {e}")
+ raise
+ finally:
if self.parquet_writer:
self.parquet_writer.close()
+ self.parquet_writer = None
if self.id_parquet_writer:
self.id_parquet_writer.close()
+ self.id_parquet_writer = None
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)
293-294
: Improve SQL query constructionThe current approach to sanitizing column names could be more robust.
- safe_columns = [col for col in base_columns if col.replace(" ", "").isalnum()] # Remove spaces - query = f"""SELECT {', '.join(safe_columns)} FROM frames """ + # Whitelist of allowed SQL expressions + allowed_expressions = { + "CASE WHEN MsMsType IN (8, 9) THEN 2 WHEN MsMsType = 0 THEN 1 ELSE NULL END as MsMsType", + "Id", "NumPeaks", "MaxIntensity", "SummedIntensities", "Time", "Charge", "MonoisotopicMz" + } + safe_columns = [col for col in base_columns if col in allowed_expressions] + query = f"""SELECT {', '.join(safe_columns)} FROM frames"""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(1 hunks)
🔇 Additional comments (5)
quantmsutils/mzml/mzml_statistics.py (5)
76-77
: 🛠️ Refactor suggestion
Optimize array handling to avoid unnecessary conversions
Converting numpy arrays to Python lists creates unnecessary copies of large arrays.
- "mz": mz_array.tolist(),
- "intensity": intensity_array.tolist(),
+ "mz": mz_array,
+ "intensity": intensity_array,
Likely invalid or redundant comment.
38-38
:
Fix incorrect regular expression pattern
The current regex pattern r"[scan|spectrum]=(\d+)"
uses square brackets []
which defines a character class matching any single character in the set. This will not correctly match "scan=" or "spectrum=".
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
Likely invalid or redundant comment.
238-238
:
Fix missing parameter in BatchWritingConsumer constructor
The constructor call is missing the required id_parquet_schema
parameter.
- consumer = BatchWritingConsumer(parquet_schema, output_path, batch_size, id_only)
+ consumer = BatchWritingConsumer(parquet_schema, id_parquet_schema, output_path, batch_size, id_only)
Likely invalid or redundant comment.
64-68
:
Add check for empty precursors list
Accessing spectrum.getPrecursors()[0]
without checking if the list is non-empty can raise an IndexError.
if ms_level == 2:
+ precursors = spectrum.getPrecursors()
+ if not precursors:
+ charge_state = None
+ exp_mz = None
+ else:
- precursor = spectrum.getPrecursors()[0]
+ precursor = precursors[0]
- charge_state = precursor.getCharge()
- exp_mz = precursor.getMZ()
+ charge_state = precursor.getCharge()
+ exp_mz = precursor.getMZ()
Likely invalid or redundant comment.
153-169
: 🛠️ Refactor suggestion
Improve resource management in finalize method
The finalize method should ensure proper cleanup of resources even if an exception occurs.
def finalize(self):
+ try:
if self.batch_data:
self._write_batch()
if self.id_only and self.psm_parts:
self._write_batch()
+ except Exception as e:
+ print(f"Error during finalization: {e}")
+ raise
+ finally:
if self.parquet_writer:
- self.parquet_writer.close()
+ try:
+ self.parquet_writer.close()
+ finally:
+ self.parquet_writer = None
if self.id_parquet_writer:
- self.id_parquet_writer.close()
+ try:
+ self.id_parquet_writer.close()
+ finally:
+ self.id_parquet_writer = None
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my remaining change request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
quantmsutils/mzml/mzml_statistics.py (2)
299-302
: Strengthen SQL column name validationWhile the current validation removes spaces, it could be more robust by using a stricter pattern for valid SQL column names.
safe_columns = [ - col for col in base_columns if col.replace(" ", "").isalnum() + col for col in base_columns + if all(part.isalnum() or part in ['_', ' '] for part in col) + and not col.startswith(('_', ' ')) ]
348-349
: Ensure case-insensitive file extension handlingThe extension comparison could be made more robust by normalizing extensions to lowercase before comparison.
- valid_extensions = {".d", ".mzml", ".mzML"} - candidates = [str(c.resolve()) for c in candidates if c.suffix.lower() in valid_extensions] + valid_extensions = {ext.lower() for ext in {".d", ".mzml"}} + candidates = [str(c.resolve()) for c in candidates if c.suffix.lower() in valid_extensions]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(1 hunks)
🔇 Additional comments (4)
quantmsutils/mzml/mzml_statistics.py (4)
76-77
: 🛠️ Refactor suggestion
Optimize array handling to avoid unnecessary conversions
Converting numpy arrays to Python lists creates unnecessary copies of large arrays, impacting performance.
- "mz": mz_array.tolist(),
- "intensity": intensity_array.tolist(),
+ "mz": mz_array,
+ "intensity": intensity_array,
Likely invalid or redundant comment.
38-38
:
Fix incorrect regular expression pattern
The current pattern r"[scan|spectrum]=(\d+)"
is incorrect. Square brackets []
define a character class that matches any single character within the brackets, not alternation between words. This will fail to correctly extract scan IDs.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
Likely invalid or redundant comment.
64-68
:
Add precursor existence check to prevent IndexError
Accessing spectrum.getPrecursors()[0]
without checking if precursors exist can raise an IndexError.
if ms_level == 2:
- precursor = spectrum.getPrecursors()[0]
- charge_state = precursor.getCharge()
- exp_mz = precursor.getMZ()
+ precursors = spectrum.getPrecursors()
+ if precursors:
+ precursor = precursors[0]
+ charge_state = precursor.getCharge()
+ exp_mz = precursor.getMZ()
+ else:
+ charge_state = None
+ exp_mz = None
Likely invalid or redundant comment.
153-169
: 🛠️ Refactor suggestion
Improve resource management in finalize method
The finalize method should ensure proper cleanup of resources even if an exception occurs.
def finalize(self):
+ try:
if self.batch_data:
self._write_batch()
if self.id_only and self.psm_parts:
self._write_batch()
+ finally:
if self.parquet_writer:
self.parquet_writer.close()
+ self.parquet_writer = None
if self.id_parquet_writer:
self.id_parquet_writer.close()
+ self.id_parquet_writer = None
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
quantmsutils/mzml/mzml_statistics.py (1)
307-307
: Use parameterized query construction for better securityWhile the columns are validated, constructing the SQL query using string concatenation could still be risky.
- query = f"""SELECT {', '.join(safe_columns)} FROM frames""" + query = f"""SELECT {', '.join('?' for _ in safe_columns)} FROM frames""" + safe_columns_tuple = tuple(safe_columns)Then use the tuple with the parameterized query:
for chunk in pd.read_sql_query(query, conn, params=safe_columns_tuple, chunksize=batch_size):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
quantmsutils/mzml/mzml_statistics.py
(1 hunks)
🔇 Additional comments (2)
quantmsutils/mzml/mzml_statistics.py (2)
38-38
:
Fix incorrect regular expression pattern
The pattern r"[scan|spectrum]=(\d+)"
is incorrect as square brackets []
define a character class that matches any single character within them. This will match any of the characters 's', 'c', 'a', 'n', '|', 'p', 'e', 't', 'r', 'u', 'm' followed by '='.
- self.scan_pattern = re.compile(r"[scan|spectrum]=(\d+)")
+ self.scan_pattern = re.compile(r"(scan|spectrum)=(\d+)")
Likely invalid or redundant comment.
64-68
:
Add precursor existence check to prevent IndexError
Direct access to spectrum.getPrecursors()[0]
can raise an IndexError if no precursors are available.
if ms_level == 2:
- precursor = spectrum.getPrecursors()[0]
- charge_state = precursor.getCharge()
- exp_mz = precursor.getMZ()
+ precursors = spectrum.getPrecursors()
+ if precursors:
+ precursor = precursors[0]
+ charge_state = precursor.getCharge()
+ exp_mz = precursor.getMZ()
+ else:
+ charge_state = None
+ exp_mz = None
Likely invalid or redundant comment.
This PR, instead of loading everything in the panda's dataframe all the data; it writes the intensity values using parquet batches.
Summary by CodeRabbit
Release Notes
New Features
Improvements
quantms-utils
package to 0.0.16.Chores
.gitignore
to exclude specific test data directories.