Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267

h-mayorquin · 2023-12-05T11:18:17Z

Currently the Bruker parsing of streams is failng for the files from the Clandinin conversion. The part that fails is the plane extraction as the regular expression does not match properly (the script that generates the failure is on the bottom):

Traceback (most recent call last):
  File "/home/heberto/development/roiextractors/build/dev.py", line 12, in <module>
    streams  = get_streams()
  File "/home/heberto/development/roiextractors/build/dev.py", line 11, in <lambda>
    get_streams = lambda: BrukerTiffMultiPlaneImagingExtractor.get_streams(folder_path=folder_path)
  File "/home/heberto/development/roiextractors/src/roiextractors/extractors/tiffimagingextractors/brukertiffimagingextractor.py", line 123, in get_streams
    plane_stream_names = [
  File "/home/heberto/development/roiextractors/src/roiextractors/extractors/tiffimagingextractors/brukertiffimagingextractor.py", line 124, in <listcomp>
    re.search(plane_naming_pattern, file.attrib["filename"])["stream_name"]
TypeError: 'NoneType' object is not subscriptable

roiextractors/src/roiextractors/extractors/tiffimagingextractors/brukertiffimagingextractor.py

Lines 120 to 130 in b33f00c

    
           unique_channel_ids = natsort.natsorted(set(channel_ids)) 
        
           for channel_id, channel_name in zip(unique_channel_ids, unique_channel_names): 
        
               plane_naming_pattern = rf"(?P<stream_name>Ch{channel_id}_\d+)" 
        
               plane_stream_names = [ 
        
                   re.search(plane_naming_pattern, file.attrib["filename"])["stream_name"] 
        
                   for file in xml_root.findall(f".//File") 
        
               ] 
        
               unique_plane_stream_names = natsort.natsorted(set(plane_stream_names)) 
        
               streams["plane_streams"][channel_name] = unique_plane_stream_names 
        
           return streams

Even worse reading large xml files with the current setup is terribly slow. Even commenting out the part that fails above this is the current reading speed for a ~ 30 MiB file:

(the script, added at the bottom does this five times)

Execution times: ['9.278', '8.280', '8.254', '8.178', '8.191']
Average time: 8.436
Standard deviation: 0.472

This PR fixes the regular expression and uses iterparse to read the file faster. It only reads the information that is needed instead of parsing the whole file to extract the information from the first Sequence. This reduces xml metadata extraction drastically:

Execution times: ['0.011', '0.002', '0.002', '0.002', '0.002']
Average time: 0.004
Standard deviation: 0.004

So a thousand-fold improvement with large xml files.

The script:

import timeit
from pathlib import Path
from roiextractors.extractors.tiffimagingextractors.brukertiffimagingextractor import BrukerTiffMultiPlaneImagingExtractor
from statistics import mean, stdev

# Define the folder path
folder_path = Path("/media/heberto/One Touch/Clandinin-CN-data-share/brezovec_example_data/imports/20200228/fly3/func_0/TSeries-02282020-0943-001")
assert folder_path.is_dir()

# Define the function to be timed
get_streams = lambda: BrukerTiffMultiPlaneImagingExtractor.get_streams(folder_path=folder_path)
# streams  = get_streams()
# print(f"{streams=}")

# Time the execution
num_executions = 5
times = timeit.repeat(get_streams, number=1, repeat=num_executions)

# Calculate and format the results
average_time = mean(times)
std_dev_time = stdev(times)
formatted_times = [f"{t:.3f}" for t in times]

print("Execution times:", formatted_times)
print("Average time: {:.3f}".format(average_time))
print("Standard deviation: {:.3f}".format(std_dev_time))

for more information, see https://pre-commit.ci

…' into improve_bruker_tiff

CodyCBakerPhD · 2023-12-05T14:11:46Z

Leaving this one to @alessandratrapani - I would ask though, for testing of the regex, if it's possible to have the header file included in the example files for the format so we have a diversity of examples there to consider and test against

h-mayorquin · 2023-12-05T14:23:43Z

For the header, are you are talking about the xml file?

Two points:

It is 30 MiB. We could stub it though to only have a couple of sequence.
It is from the Clandinin group so we should ask.

Probably better to just generate a fake xml file based with the same structure but new values.

CodyCBakerPhD · 2023-12-05T14:42:47Z

For the header, are you are talking about the xml file?

Yes

It is 30 MiB. We could stub it though to only have a couple of sequence.

I believe we stubbed the ones currently on the GIN as well; the structure of the XML makes it pretty easy to clip the sequence at a specific point. Only need a few frames I assume to get the novelty of this point across for the tests

It is from the Clandinin group so we should ask.

Yep, definetely ask

Probably better to just generate a fake xml file based with the same structure but new values.

Not sure how that's easier or better than stubbing the existing file other than introducing additional sources of spurious errors; I'd always trust an existing file that's been clipped at a point more than something generated synthetically

h-mayorquin · 2023-12-05T14:59:58Z

Not sure how that's easier or better than stubbing the existing file other than introducing additional sources of spurious errors; I'd always trust an existing file that's been clipped at a point more than something generated synthetically

It it is faster and easier for me. Agree on the risk of making a mistake though. Let's see what @alessandratrapani wants.

alessandratrapani

LGTM

alessandratrapani · 2023-12-19T13:54:59Z

I don't have a strong preference, but I would try to avoid as much as possible generate additional errors.

h-mayorquin · 2023-12-19T14:25:35Z

@alessandratrapani
Thanks for taking a look.
I asked this guys already (I think you were on the email?) so if they reply in some time I will ad the stubbed version at somt point.

codecov · 2023-12-19T14:26:40Z

Codecov Report

Merging #267 (518baca) into main (2c08f65) will increase coverage by 0.12%.
The diff coverage is 96.87%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #267      +/-   ##
==========================================
+ Coverage   79.00%   79.13%   +0.12%     
==========================================
  Files          39       39              
  Lines        3030     3053      +23     
==========================================
+ Hits         2394     2416      +22     
- Misses        636      637       +1

Flag	Coverage Δ
unittests	`79.13% <96.87%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...iffimagingextractors/brukertiffimagingextractor.py	`95.70% <96.87%> (-0.01%)`	⬇️

improve stream checking in Bruker

6d4a564

h-mayorquin self-assigned this Dec 5, 2023

pre-commit-ci bot and others added 4 commits December 5, 2023 11:18

[pre-commit.ci] auto fixes from pre-commit.com hooks

a732796

for more information, see https://pre-commit.ci

windows no context manager

02e03cf

changelog add

292b815

Merge remote-tracking branch 'refs/remotes/origin/improve_bruker_tiff…

e7060fc

…' into improve_bruker_tiff

h-mayorquin requested review from weiglszonja, CodyCBakerPhD and alessandratrapani December 5, 2023 11:34

h-mayorquin marked this pull request as ready for review December 5, 2023 11:35

h-mayorquin changed the title ~~Improve Bruker xml parsing by using iterparse~~ Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file Dec 5, 2023

alessandratrapani approved these changes Dec 19, 2023

View reviewed changes

Merge branch 'main' into improve_bruker_tiff

518baca

h-mayorquin enabled auto-merge December 19, 2023 14:25

h-mayorquin merged commit 70b629d into main Dec 19, 2023
15 checks passed

h-mayorquin deleted the improve_bruker_tiff branch December 19, 2023 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267

Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267

h-mayorquin commented Dec 5, 2023 •

edited

Loading

CodyCBakerPhD commented Dec 5, 2023

h-mayorquin commented Dec 5, 2023

CodyCBakerPhD commented Dec 5, 2023

h-mayorquin commented Dec 5, 2023

alessandratrapani left a comment

alessandratrapani commented Dec 19, 2023

h-mayorquin commented Dec 19, 2023

codecov bot commented Dec 19, 2023

	unique_channel_ids = natsort.natsorted(set(channel_ids))
	for channel_id, channel_name in zip(unique_channel_ids, unique_channel_names):
	plane_naming_pattern = rf"(?P<stream_name>Ch{channel_id}_\d+)"
	plane_stream_names = [
	re.search(plane_naming_pattern, file.attrib["filename"])["stream_name"]
	for file in xml_root.findall(f".//File")
	]
	unique_plane_stream_names = natsort.natsorted(set(plane_stream_names))
	streams["plane_streams"][channel_name] = unique_plane_stream_names
	return streams

Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267

Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267

Conversation

h-mayorquin commented Dec 5, 2023 • edited Loading

CodyCBakerPhD commented Dec 5, 2023

h-mayorquin commented Dec 5, 2023

CodyCBakerPhD commented Dec 5, 2023

h-mayorquin commented Dec 5, 2023

alessandratrapani left a comment

Choose a reason for hiding this comment

alessandratrapani commented Dec 19, 2023

h-mayorquin commented Dec 19, 2023

codecov bot commented Dec 19, 2023

Codecov Report

h-mayorquin commented Dec 5, 2023 •

edited

Loading