Slow read times with with multi-time step file using Python bindings #168

rmchurch · 2018-02-28T21:35:25Z

I have a bp file that has multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:

f = ad.file(file)
key = f.var.keys()[0]
print key,f[key]
e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961)
%time data = f[key][...]
CPU times: user 394 ms, sys: 942 ms, total: 1.34 s                                                          
Wall time: 1min 3s

If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:

f = h5py.File(file)
print f[key]
<HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8">
%time data = f[key][...]
CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms
Wall time: 8.78 ms

I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.

The text was updated successfully, but these errors were encountered:

pnorbert · 2018-03-05T15:45:15Z

Michael, Can you make your file available for us at OLCF or NERSC? Is this a single bp file or a directory with many subfiles? This is the diagnostics written by a single process, right? I just made a test file on my VM of 200 variables and 7000 steps (each variable is 5 by 5 2D array) and the read time is fast. AdiosVar (varid=7, name='v001', dtype=dtype('int32'), ndim=2, dims=(5L, 5L), nsteps=7000, attrs=[]) 0.37046790123 This is my python test reader #!/usr/bin/python import numpy import adios from timeit import default_timer as timer f = adios.file('many_vars.bp') v = f.var['v001'] print v s=timer() data = v.read() e=timer() print(e-s) f.close() Thanks Norbert

…

On Wed, Feb 28, 2018 at 4:35 PM, Michael Churchill ***@***.*** > wrote: I have a bp file that has a multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow: f = ad.file(file) key = f.var.keys()[0] print key,f[key] e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961) %time data = f[key][...] CPU times: user 394 ms, sys: 942 ms, total: 1.34 s Wall time: 1min 3s If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster: f = h5py.File(file) print f[key] <HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8"> %time data = f[key][...] CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms Wall time: 8.78 ms I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#168>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLRBOoqFe7mi8FpqYhJzh8jENqLlNks5tZcadgaJpZM4SXYi1> .

rmchurch · 2018-03-05T16:28:35Z

The data is on Edison, /scratch2/scratchdirs/rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp
It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios.

pnorbert · 2018-03-05T17:48:26Z

Okay, I see. ADIOS does not cache it. It is the system that caches data. With the current file format and read implementation, there are 7000 consecutive seeks and reads to get the array with all steps, and this is slow for remote disks. The next time it's reading from cache and it's much faster. I wonder where the hdf5 file was when you got the data in a few milliseconds.

…

On Mon, Mar 5, 2018 at 11:28 AM, Michael Churchill ***@***.*** > wrote: The data is on Edison, /scratch2/scratchdirs/ rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#168 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLRMi_031GElESVIt1OwE77Ce1f_Iks5tbWe0gaJpZM4SXYi1> .

rmchurch · 2018-03-05T18:01:10Z

HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp).

pnorbert · 2018-03-05T18:03:33Z

I meant, was it in cache or not.

…

On Mon, Mar 5, 2018 at 1:01 PM, Michael Churchill ***@***.***> wrote: HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#168 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLam7hvDA_jJ-2Qz4G9YzMAMlZJtQks5tbX1ngaJpZM4SXYi1> .

rmchurch · 2018-03-05T18:19:38Z

I don't think so. I tried both the bp and h5 today, after having accessed them last week (so I assume both were out of cache by now). Both had the same read timings as before, and had the same characteristic that the 2nd read of the same data would take much less time (suggesting it was cached). The HDF5 data took about 100ms to read on the first read, whereas the bp file took 1 minute on the first read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow read times with with multi-time step file using Python bindings #168

Slow read times with with multi-time step file using Python bindings #168

rmchurch commented Feb 28, 2018 •

edited

Loading

pnorbert commented Mar 5, 2018 via email

rmchurch commented Mar 5, 2018

pnorbert commented Mar 5, 2018 via email

rmchurch commented Mar 5, 2018

pnorbert commented Mar 5, 2018 via email

rmchurch commented Mar 5, 2018

Slow read times with with multi-time step file using Python bindings #168

Slow read times with with multi-time step file using Python bindings #168

Comments

rmchurch commented Feb 28, 2018 • edited Loading

pnorbert commented Mar 5, 2018 via email

rmchurch commented Mar 5, 2018

pnorbert commented Mar 5, 2018 via email

rmchurch commented Mar 5, 2018

pnorbert commented Mar 5, 2018 via email

rmchurch commented Mar 5, 2018

rmchurch commented Feb 28, 2018 •

edited

Loading