Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why access performance isn't improved uniformly by repacking metadata #19

Open
4 of 7 tasks
betolink opened this issue Aug 30, 2023 · 6 comments
Open
4 of 7 tasks
Assignees

Comments

@betolink
Copy link
Member

betolink commented Aug 30, 2023

On the files we tested over Antarctica, repacking the metadata with h5repack didn't improve access times in a dramatic way, specially for xarray and h5py. These granules contained a lot of data and each was around 6GB, with ~7MB of metadata. They were selected and processed using this notebook

e.g. ATL03_20181120182818_08110112_006_02.h5 ~7GB in size and 7MB of metadata

Note: The S3 bucket with the original data is gone but can be easily recreated.

https://raw.githubusercontent.com/ICESAT-2HackWeek/h5cloud/1f3441190951e5a2da74611f1196a657db7035bd/notebooks/arr_mean_bar_plot.png

However for other granules with less data, repacking represented a 10X improvement for xarray

e.g. ATL03_20220201060852_06261401_005_01.h5 ~500MB in size and 3MB of metadata

After applying h5repack to both files the access time to the first one is not improved for xarray but it is improved from 1 minute to 5 seconds for the second granule, why?

group = '/gt2l/heights'
variable = 'h_ph'

with s3.open(file, 'rb') as file_stream:
     ds = xr.open_dataset(file_stream, group=group, engine='h5netcdf')
     variable_mean = ds[variable].mean()

I'm going to repack the original files and put them on a more durable bucket, along with more examples from other NASA datasets.

Maybe @ajelenak has some clues on why this may be happening.

Tasks

Preview Give feedback
No tasks being tracked yet.
@ajelenak
Copy link

Hi @betolink,

Repacking the file is the necessary first step but then the instructions to use the features available in the repacked file must be passed to libhdf5. I know it can be done from h5py, but have not yet verified if the same is possible from xarray and h5netcdf. It probably is because I've seen xarray code where backend storage engine options are set in the open_dataset() call.

The variable mean calculation example will read all the data only once for the /gt2l/heights/h_ph dataset and then discard it, which means that available libhdf5 caches may not help much in this use case.

@betolink
Copy link
Member Author

The curious thing is that in some instances repacked files get faster times compared to their non repacked original version without passing any special parameter to h5py or xarray.

@ajelenak
Copy link

That's probably because of the paged aggregation applied to the repacked file that forces libhdf5 to only make S3 requests of the file page size. Those pages then bring much more data (likely quite a few chunks in one request) compared to the original file where libhdf5 can make S3 requests starting from as little as 8 bytes.

@asteiker asteiker mentioned this issue Nov 2, 2023
8 tasks
@betolink
Copy link
Member Author

We had a very interesting conversation/brainstorming session with @ajelenak during AGU23, he is developing tools to trace the behavior of h5py over the network: https://github.com/ajelenak/ros3vfd-log-info that we'll use to have a better idea of how repacking and doing page aggregation makes an impact on file access times. I'm not sure if this tool can be used with the h5py -> fsspec or just for the rosv3 driver.

@ajelenak
Copy link

Currently it can only parse libhdf5's ros3 driver logs. I was interested in those because they are the most accurate information about where in a file and how many bytes libhdf5 is reading. An fsspec log parser can certainly be added. Do you have one to share?

@betolink
Copy link
Member Author

Working on it! @ajelenak fsspec logs are too verbose and I'm figuring out how can we create a filter before they get flushed to match what this tool needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants