-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Bruker xml parsing by using iterparse instead of parsing the complete xml file #267
Conversation
for more information, see https://pre-commit.ci
…' into improve_bruker_tiff
Leaving this one to @alessandratrapani - I would ask though, for testing of the regex, if it's possible to have the header file included in the example files for the format so we have a diversity of examples there to consider and test against |
For the header, are you are talking about the xml file? Two points:
Probably better to just generate a fake xml file based with the same structure but new values. |
Yes
I believe we stubbed the ones currently on the GIN as well; the structure of the XML makes it pretty easy to clip the sequence at a specific point. Only need a few frames I assume to get the novelty of this point across for the tests
Yep, definetely ask
Not sure how that's easier or better than stubbing the existing file other than introducing additional sources of spurious errors; I'd always trust an existing file that's been clipped at a point more than something generated synthetically |
It it is faster and easier for me. Agree on the risk of making a mistake though. Let's see what @alessandratrapani wants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I don't have a strong preference, but I would try to avoid as much as possible generate additional errors. |
@alessandratrapani |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #267 +/- ##
==========================================
+ Coverage 79.00% 79.13% +0.12%
==========================================
Files 39 39
Lines 3030 3053 +23
==========================================
+ Hits 2394 2416 +22
- Misses 636 637 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Currently the Bruker parsing of streams is failng for the files from the Clandinin conversion. The part that fails is the plane extraction as the regular expression does not match properly (the script that generates the failure is on the bottom):
roiextractors/src/roiextractors/extractors/tiffimagingextractors/brukertiffimagingextractor.py
Lines 120 to 130 in b33f00c
Even worse reading large xml files with the current setup is terribly slow. Even commenting out the part that fails above this is the current reading speed for a ~ 30 MiB file:
(the script, added at the bottom does this five times)
This PR fixes the regular expression and uses iterparse to read the file faster. It only reads the information that is needed instead of parsing the whole file to extract the information from the first Sequence. This reduces xml metadata extraction drastically:
So a thousand-fold improvement with large xml files.
The script: