Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267

Merged
merged 6 commits into from
Dec 19, 2023

Conversation

h-mayorquin
Copy link
Collaborator

@h-mayorquin h-mayorquin commented Dec 5, 2023

Currently the Bruker parsing of streams is failng for the files from the Clandinin conversion. The part that fails is the plane extraction as the regular expression does not match properly (the script that generates the failure is on the bottom):

Traceback (most recent call last):
  File "/home/heberto/development/roiextractors/build/dev.py", line 12, in <module>
    streams  = get_streams()
  File "/home/heberto/development/roiextractors/build/dev.py", line 11, in <lambda>
    get_streams = lambda: BrukerTiffMultiPlaneImagingExtractor.get_streams(folder_path=folder_path)
  File "/home/heberto/development/roiextractors/src/roiextractors/extractors/tiffimagingextractors/brukertiffimagingextractor.py", line 123, in get_streams
    plane_stream_names = [
  File "/home/heberto/development/roiextractors/src/roiextractors/extractors/tiffimagingextractors/brukertiffimagingextractor.py", line 124, in <listcomp>
    re.search(plane_naming_pattern, file.attrib["filename"])["stream_name"]
TypeError: 'NoneType' object is not subscriptable

unique_channel_ids = natsort.natsorted(set(channel_ids))
for channel_id, channel_name in zip(unique_channel_ids, unique_channel_names):
plane_naming_pattern = rf"(?P<stream_name>Ch{channel_id}_\d+)"
plane_stream_names = [
re.search(plane_naming_pattern, file.attrib["filename"])["stream_name"]
for file in xml_root.findall(f".//File")
]
unique_plane_stream_names = natsort.natsorted(set(plane_stream_names))
streams["plane_streams"][channel_name] = unique_plane_stream_names
return streams

Even worse reading large xml files with the current setup is terribly slow. Even commenting out the part that fails above this is the current reading speed for a ~ 30 MiB file:

(the script, added at the bottom does this five times)

Execution times: ['9.278', '8.280', '8.254', '8.178', '8.191']
Average time: 8.436
Standard deviation: 0.472

This PR fixes the regular expression and uses iterparse to read the file faster. It only reads the information that is needed instead of parsing the whole file to extract the information from the first Sequence. This reduces xml metadata extraction drastically:

Execution times: ['0.011', '0.002', '0.002', '0.002', '0.002']
Average time: 0.004
Standard deviation: 0.004

So a thousand-fold improvement with large xml files.

The script:

import timeit
from pathlib import Path
from roiextractors.extractors.tiffimagingextractors.brukertiffimagingextractor import BrukerTiffMultiPlaneImagingExtractor
from statistics import mean, stdev

# Define the folder path
folder_path = Path("/media/heberto/One Touch/Clandinin-CN-data-share/brezovec_example_data/imports/20200228/fly3/func_0/TSeries-02282020-0943-001")
assert folder_path.is_dir()

# Define the function to be timed
get_streams = lambda: BrukerTiffMultiPlaneImagingExtractor.get_streams(folder_path=folder_path)
# streams  = get_streams()
# print(f"{streams=}")

# Time the execution
num_executions = 5
times = timeit.repeat(get_streams, number=1, repeat=num_executions)

# Calculate and format the results
average_time = mean(times)
std_dev_time = stdev(times)
formatted_times = [f"{t:.3f}" for t in times]

print("Execution times:", formatted_times)
print("Average time: {:.3f}".format(average_time))
print("Standard deviation: {:.3f}".format(std_dev_time))

@h-mayorquin h-mayorquin self-assigned this Dec 5, 2023
@h-mayorquin h-mayorquin marked this pull request as ready for review December 5, 2023 11:35
@h-mayorquin h-mayorquin changed the title Improve Bruker xml parsing by using iterparse Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file Dec 5, 2023
@CodyCBakerPhD
Copy link
Member

Leaving this one to @alessandratrapani - I would ask though, for testing of the regex, if it's possible to have the header file included in the example files for the format so we have a diversity of examples there to consider and test against

@h-mayorquin
Copy link
Collaborator Author

For the header, are you are talking about the xml file?

Two points:

  1. It is 30 MiB. We could stub it though to only have a couple of sequence.
  2. It is from the Clandinin group so we should ask.

Probably better to just generate a fake xml file based with the same structure but new values.

@CodyCBakerPhD
Copy link
Member

For the header, are you are talking about the xml file?

Yes

It is 30 MiB. We could stub it though to only have a couple of sequence.

I believe we stubbed the ones currently on the GIN as well; the structure of the XML makes it pretty easy to clip the sequence at a specific point. Only need a few frames I assume to get the novelty of this point across for the tests

It is from the Clandinin group so we should ask.

Yep, definetely ask

Probably better to just generate a fake xml file based with the same structure but new values.

Not sure how that's easier or better than stubbing the existing file other than introducing additional sources of spurious errors; I'd always trust an existing file that's been clipped at a point more than something generated synthetically

@h-mayorquin
Copy link
Collaborator Author

Not sure how that's easier or better than stubbing the existing file other than introducing additional sources of spurious errors; I'd always trust an existing file that's been clipped at a point more than something generated synthetically

It it is faster and easier for me. Agree on the risk of making a mistake though. Let's see what @alessandratrapani wants.

Copy link
Collaborator

@alessandratrapani alessandratrapani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alessandratrapani
Copy link
Collaborator

I don't have a strong preference, but I would try to avoid as much as possible generate additional errors.

@h-mayorquin
Copy link
Collaborator Author

@alessandratrapani
Thanks for taking a look.
I asked this guys already (I think you were on the email?) so if they reply in some time I will ad the stubbed version at somt point.

Copy link

codecov bot commented Dec 19, 2023

Codecov Report

Merging #267 (518baca) into main (2c08f65) will increase coverage by 0.12%.
The diff coverage is 96.87%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #267      +/-   ##
==========================================
+ Coverage   79.00%   79.13%   +0.12%     
==========================================
  Files          39       39              
  Lines        3030     3053      +23     
==========================================
+ Hits         2394     2416      +22     
- Misses        636      637       +1     
Flag Coverage Δ
unittests 79.13% <96.87%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...iffimagingextractors/brukertiffimagingextractor.py 95.70% <96.87%> (-0.01%) ⬇️

@h-mayorquin h-mayorquin merged commit 70b629d into main Dec 19, 2023
15 checks passed
@h-mayorquin h-mayorquin deleted the improve_bruker_tiff branch December 19, 2023 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants