-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate why access performance isn't improved uniformly by repacking metadata #19
Comments
Hi @betolink, Repacking the file is the necessary first step but then the instructions to use the features available in the repacked file must be passed to libhdf5. I know it can be done from h5py, but have not yet verified if the same is possible from xarray and h5netcdf. It probably is because I've seen xarray code where backend storage engine options are set in the The variable mean calculation example will read all the data only once for the |
The curious thing is that in some instances repacked files get faster times compared to their non repacked original version without passing any special parameter to h5py or xarray. |
That's probably because of the paged aggregation applied to the repacked file that forces libhdf5 to only make S3 requests of the file page size. Those pages then bring much more data (likely quite a few chunks in one request) compared to the original file where libhdf5 can make S3 requests starting from as little as 8 bytes. |
We had a very interesting conversation/brainstorming session with @ajelenak during AGU23, he is developing tools to trace the behavior of h5py over the network: https://github.com/ajelenak/ros3vfd-log-info that we'll use to have a better idea of how repacking and doing page aggregation makes an impact on file access times. I'm not sure if this tool can be used with the |
Currently it can only parse libhdf5's ros3 driver logs. I was interested in those because they are the most accurate information about where in a file and how many bytes libhdf5 is reading. An fsspec log parser can certainly be added. Do you have one to share? |
Working on it! @ajelenak fsspec logs are too verbose and I'm figuring out how can we create a filter before they get flushed to match what this tool needs. |
On the files we tested over Antarctica, repacking the metadata with
h5repack
didn't improve access times in a dramatic way, specially for xarray and h5py. These granules contained a lot of data and each was around 6GB, with ~7MB of metadata. They were selected and processed using this notebooke.g.
ATL03_20181120182818_08110112_006_02.h5
~7GB in size and 7MB of metadataHowever for other granules with less data, repacking represented a 10X improvement for xarray
e.g.
ATL03_20220201060852_06261401_005_01.h5
~500MB in size and 3MB of metadataAfter applying
h5repack
to both files the access time to the first one is not improved for xarray but it is improved from 1 minute to 5 seconds for the second granule, why?I'm going to repack the original files and put them on a more durable bucket, along with more examples from other NASA datasets.
Maybe @ajelenak has some clues on why this may be happening.
Tasks
Tasks
The text was updated successfully, but these errors were encountered: