-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up .ang file reader with pandas #417
Comments
This is a nice speedup! Do you see the same improvement with all supported file types in orix? If so then I think it is worth considering. Another option would be to not have pandas as an explicit dependency, but use pandas if installed. I think this would be doable as it seems it would only affect the io module. @hakonanes what do you think? |
Thank you for looking into speeding up reading, @argerlt. Whether NumPy or Pandas is fastest seems to me to depend on file size and/or machine architecture, since I find NumPy to be fastest. Reading the AF96 dataset file Field of view 1_EBSD data_Raw.ang seven times in five loops with # file_data = np.loadtxt(filename)
>>> np.__version__
'1.23.5'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.78 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
# file_data = pd.read_csv(filename, comment="#", header=None, sep="\s+").to_numpy()
>>> pd.__version__
'1.5.2'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.91 s ± 67 ms per loop (mean ± std. dev. of 7 runs, 5 loops each) I suggest to continue using NumPy until more people report a speed-up > 1.7x on their machines. |
Quick update:
This completely negates the need for pandas. The only thing that separated panda's reader from numpy's was the usage of a Cythonic reader, which numpy now has. Side note, I've been using this speedup trick since 2017, and it's exciting to see that it's finally obsolete. I'm closing this, and just adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more. |
There it is, thank you for searching NumPy's changelog for the cause of the discrepancy.
Good point. I myself regularly update my environment's packages to use the latest versions. Side note: Currently we don't have a lower bound on the NumPy version, but require Matplotlib >= 3.3, which requires NumPy >= 1.19. Requiring >= 1.23 would be too restrictive at the moment, I think, since this is the current minor release. |
This could probably be tacked on to #416, but changing lines 88 and 89 of
io\plugins\ang.py
from:to instead:
roughly doubles the read speed of
io.load
.Here is a code snippet using the AF96 datasets to show what I mean.
Really a question of "is adding pandas to orix's dependencies worth the 2x speedup?"
The text was updated successfully, but these errors were encountered: