-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow read times with with multi-time step file using Python bindings #168
Comments
Michael,
Can you make your file available for us at OLCF or NERSC?
Is this a single bp file or a directory with many subfiles? This is the
diagnostics written by a single process, right?
I just made a test file on my VM of 200 variables and 7000 steps (each
variable is 5 by 5 2D array) and the read time is fast.
AdiosVar (varid=7, name='v001', dtype=dtype('int32'), ndim=2, dims=(5L,
5L), nsteps=7000, attrs=[])
0.37046790123
This is my python test reader
#!/usr/bin/python
import numpy
import adios
from timeit import default_timer as timer
f = adios.file('many_vars.bp')
v = f.var['v001']
print v
s=timer()
data = v.read()
e=timer()
print(e-s)
f.close()
Thanks
Norbert
…On Wed, Feb 28, 2018 at 4:35 PM, Michael Churchill ***@***.*** > wrote:
I have a bp file that has a multiple (~100's) of 1d arrays written every
timestep, with a total of ~7000 timesteps. Using the Python bindings,
reading a single variable is pretty slow:
f = ad.file(file)
key = f.var.keys()[0]
print key,f[key]
e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961)
%time data = f[key][...]
CPU times: user 394 ms, sys: 942 ms, total: 1.34 s
Wall time: 1min 3s
If I convert using bp2h5, the conversion takes a long time (~30min), but
the reading is much faster:
f = h5py.File(file)
print f[key]
<HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8">
%time data = f[key][...]
CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms
Wall time: 8.78 ms
I assume this is because the h5 file has the data in a 2d array format,
whereas in the original bp file, the data for a single variable may not be
contiguous due to the timestepping. Is there any way to improve this
situation, either by changing the way the file is written, or changing how
I read the data with Python? I often want to read in all of the data from
the file, but this takes a long time, even though its not much data.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#168>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLRBOoqFe7mi8FpqYhJzh8jENqLlNks5tZcadgaJpZM4SXYi1>
.
|
The data is on Edison, /scratch2/scratchdirs/rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp |
Okay, I see. ADIOS does not cache it. It is the system that caches data.
With the current file format and read implementation, there are 7000
consecutive seeks and reads to get the array with all steps, and this is
slow for remote disks. The next time it's reading from cache and it's much
faster.
I wonder where the hdf5 file was when you got the data in a few
milliseconds.
…On Mon, Mar 5, 2018 at 11:28 AM, Michael Churchill ***@***.*** > wrote:
The data is on Edison, /scratch2/scratchdirs/
rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp
It's a single file (yes, written by a single process), no directories with
subfiles (total size is only about 1.5GB). I found that on the 2nd read of
the same data, the read time drops to 0.5s, not sure if this is caching
done by Edison or Adios.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#168 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLRMi_031GElESVIt1OwE77Ce1f_Iks5tbWe0gaJpZM4SXYi1>
.
|
HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp). |
I meant, was it in cache or not.
…On Mon, Mar 5, 2018 at 1:01 PM, Michael Churchill ***@***.***> wrote:
HDF5 is in the same location, you can try it there also (just h5 suffix
instead of bp).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#168 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGMLam7hvDA_jJ-2Qz4G9YzMAMlZJtQks5tbX1ngaJpZM4SXYi1>
.
|
I don't think so. I tried both the bp and h5 today, after having accessed them last week (so I assume both were out of cache by now). Both had the same read timings as before, and had the same characteristic that the 2nd read of the same data would take much less time (suggesting it was cached). The HDF5 data took about 100ms to read on the first read, whereas the bp file took 1 minute on the first read. |
I have a bp file that has multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:
If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:
I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.
The text was updated successfully, but these errors were encountered: