Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow read times with with multi-time step file using Python bindings #168

Open
rmchurch opened this issue Feb 28, 2018 · 6 comments
Open

Comments

@rmchurch
Copy link

rmchurch commented Feb 28, 2018

I have a bp file that has multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:

f = ad.file(file)
key = f.var.keys()[0]
print key,f[key]
e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961)
%time data = f[key][...]
CPU times: user 394 ms, sys: 942 ms, total: 1.34 s                                                          
Wall time: 1min 3s

If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:

f = h5py.File(file)
print f[key]
<HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8">
%time data = f[key][...]
CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms
Wall time: 8.78 ms

I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.

@pnorbert
Copy link
Contributor

pnorbert commented Mar 5, 2018 via email

@rmchurch
Copy link
Author

rmchurch commented Mar 5, 2018

The data is on Edison, /scratch2/scratchdirs/rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp
It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios.

@pnorbert
Copy link
Contributor

pnorbert commented Mar 5, 2018 via email

@rmchurch
Copy link
Author

rmchurch commented Mar 5, 2018

HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp).

@pnorbert
Copy link
Contributor

pnorbert commented Mar 5, 2018 via email

@rmchurch
Copy link
Author

rmchurch commented Mar 5, 2018

I don't think so. I tried both the bp and h5 today, after having accessed them last week (so I assume both were out of cache by now). Both had the same read timings as before, and had the same characteristic that the 2nd read of the same data would take much less time (suggesting it was cached). The HDF5 data took about 100ms to read on the first read, whereas the bp file took 1 minute on the first read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants