Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up .ang file reader with pandas #417

Closed
argerlt opened this issue Dec 8, 2022 · 4 comments
Closed

Speeding up .ang file reader with pandas #417

argerlt opened this issue Dec 8, 2022 · 4 comments

Comments

@argerlt
Copy link
Contributor

argerlt commented Dec 8, 2022

This could probably be tacked on to #416, but changing lines 88 and 89 of io\plugins\ang.py from:

    # Read all file data
    file_data = np.loadtxt(filename)

to instead:

    # Read all file data
    import pandas as pd
    file_data = pd.read_csv(filename, comment='#', header=None, sep = "\s+").to_numpy()

roughly doubles the read speed of io.load.

Here is a code snippet using the AF96 datasets to show what I mean.

import pandas as pd
import numpy as np
import time

tic = time.time()
# the numpy way
A = np.loadtxt("4D-XIII-A_cleaned.ang")
A_toc = time.time()-tic

tic = time.time()
# the pandas way, which is then converted to a numpy array
B = pd.read_csv("4D-XIII-A_cleaned.ang", comment='#', header=None, sep = "\s+")
C = B.to_numpy()
B_toc = time.time()-tic

print(A_toc)
print(B_toc)

>>> 13.67614459991455
>>> 8.081970691680908

Really a question of "is adding pandas to orix's dependencies worth the 2x speedup?"

@harripj
Copy link
Collaborator

harripj commented Dec 8, 2022

This is a nice speedup! Do you see the same improvement with all supported file types in orix? If so then I think it is worth considering.

Another option would be to not have pandas as an explicit dependency, but use pandas if installed. I think this would be doable as it seems it would only affect the io module.

@hakonanes what do you think?

@hakonanes
Copy link
Member

Thank you for looking into speeding up reading, @argerlt.

Whether NumPy or Pandas is fastest seems to me to depend on file size and/or machine architecture, since I find NumPy to be fastest. Reading the AF96 dataset file Field of view 1_EBSD data_Raw.ang seven times in five loops with %timeit gives me the following results:

# file_data = np.loadtxt(filename)
>>> np.__version__
'1.23.5'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.78 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

# file_data = pd.read_csv(filename, comment="#", header=None, sep="\s+").to_numpy()
>>> pd.__version__
'1.5.2'
>>> %timeit -n 5 xmap = io.load("Field of view 1_EBSD data_Raw.ang")
2.91 s ± 67 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

I suggest to continue using NumPy until more people report a speed-up > 1.7x on their machines.

@argerlt
Copy link
Contributor Author

argerlt commented Dec 8, 2022

Quick update:
Looking at your speeds, it surprised me just how much faster your numpy was compared to mine. Then I found this bullet point in the changelog for numpy 1.23.0:

... The highlights are:

  • Implementation of loadtxt in C, greatly improving its performance.
  • Exposing DLPack at the Python level for easy data exchange.
  • Changes to the promotion and comparisons of structured dtypes.
  • Improvements to f2py.

This completely negates the need for pandas. The only thing that separated panda's reader from numpy's was the usage of a Cythonic reader, which numpy now has. Side note, I've been using this speedup trick since 2017, and it's exciting to see that it's finally obsolete.

I'm closing this, and just adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more.

@argerlt argerlt closed this as completed Dec 8, 2022
@hakonanes
Copy link
Member

There it is, thank you for searching NumPy's changelog for the cause of the discrepancy.

adding the comment that setting the required numpy package to >= 1.23 will automatically speed up orix.io.load by roughly a factor of 2 or more.

Good point. I myself regularly update my environment's packages to use the latest versions.

Side note: Currently we don't have a lower bound on the NumPy version, but require Matplotlib >= 3.3, which requires NumPy >= 1.19. Requiring >= 1.23 would be too restrictive at the moment, I think, since this is the current minor release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants