Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HSDS server with omas_h5.py function #313

Closed
wants to merge 4 commits into from

Conversation

satelite2517
Copy link

@satelite2517 satelite2517 commented Jul 24, 2024

  • Updated dict2hdf5 function to accept an hsds parameter to switch between h5py and h5pyd.
  • Modified save_omas_h5 to pass the hsds parameter to dict2hdf5.
  • Updated convertDataset to pass the hsds parameter recursively.
  • Changed load_omas_h5 to accept an hsds parameter and use h5pyd when hsds is True.

These changes enable the use of HSDS (Highly Scalable Data Service) for handling HDF5 files, �which can directly connect HSDS service and OMAS. (HSDS is the hdf5 based database system https://github.com/HDFGroup/hsds)

@orso82
Copy link
Member

orso82 commented Jul 24, 2024

Fantastic work Sunjae!

The 3.9 test was failing, but it was just a fluke. I re-triggered the test and it passed, no problem.

Could you please add the hsds parameter to the docstrings?
https://github.com/gafusion/omas/pull/313/files#diff-850086985605b5e6371f79e6f178ac4671f7dcf0925766cc32e807ab608bdddcL25

Also, for me to understand better, could you please give an example of how using h5pyd would work? Something like:

from omas import *
ods = ODS()
ods['equilibrium.time_slice.0.global_quantities.ip'] = 6
ods. save_omas_h5("???", hsds=True)

ods1 = load_omas_h5("???", hsds=True)

How do you pass the HSDS server information?

Thanks!

@satelite2517
Copy link
Author

satelite2517 commented Jul 24, 2024

Hello, first, please check the added parameters.

For now the information related to the HSDS server info should be set up before using omas.
The process to get server info :

1.	Install the module from [h5pyd](https://github.com/HDFGroup/h5pyd) pip.
2.	Use the command prompt to input our server address, username, and password issued to each user into Hsconfigure, which will save them in the config file.
3.	From then on, you can use the server files as if they are local by using h5py.

The server info only needs to be provided correctly at the beginning, so I didn’t add it to the omas function. Since HSDS currently do not offer speed optimization programs such as multithreading, we plan to use the our own VEST module additionally to provide dynamic and static data separately until the speed optimization is achieved.(And this contains the config function and pretty other functions so I didn't add on omas.) HSDS multimanager

Below is an example of the server we configured. (The omas save/load functions the server info is just optional since it's already stored in config file.)

 from omas import *
 ods = ODS()
 ods['equilibrium.time_slice.0.global_quantities.ip'] = 6
 ods. save_omas_h5("http://127.0.0.0:5101/home/sample.h5", hsds=True) # same with ods. save_omas_h5("/home/sample.h5", hsds=True) and the server info is diff

 ods1 = load_omas_h5("/home/sample.h5", hsds=True) # or ods1 = load_omas_h5("http://127.0.0.0:5101/home/sample.h5", hsds=True))

@orso82
Copy link
Member

orso82 commented Jul 24, 2024

I'll let other comments on this PR, but it does look good to me. @smithsp ? @torrinba ?

Once OMAS saves the data in HDF5 hierarchical format, using HSDS seems like a great way to serve IMAS data!!!
It's simple, scalable, uses a very IMASy syntax to access data, and loads only the data that one wants. More details about HSDS.

Based on your example this should work:

from omas import *
ods = ODS()
ods['equilibrium.time_slice.0.global_quantities.ip'] = 6
omas.save_omas_h5(ods,"http://127.0.0.0:5101/home/sample.h5")

import h5pyd as h5py
h5_file = h5py.File("http://127.0.0.0:5101/home/sample.h5", 'r')
ip_data = h5_file['equilibrium/time_slice/0/global_quantities/ip'].value
print(ip_data)

OMAS does not yet support dynamic loading (ie. lazy loading) for h5 files, like it does for NetCDF, IMAS, and machine mappings. It should not be too difficult to add though. See how it's done for NetCDF here: https://github.com/gafusion/omas/blob/master/omas/omas_nc.py#L106-L146 Once that's done, you should be able to dynamically load from a (local or remote) HDF5 only the data that you access. Something like this:

with ods.open("http://127.0.0.0:5101/home/sample.h5")
    print(ods['equilibrium.time_slice.0.global_quantities.ip']) # after implementing `dynamic_omas_h5` this will only load the data that is accessed, not everything in the h5 file

@satelite2517
Copy link
Author

satelite2517 commented Jul 24, 2024

Yes, I confirmed that it works well by just changing the IP address in the example to our address. Just to emphasize, the example will work only if the username and password have been set in advance using hsconfigure. Once that’s done, you can use h5pyd exactly the same way as h5py.

I checked the link you provided and it seems straightforward to implement. I will request improvements related to loading and saving speeds in h5pyd, and since they are also working on speed improvements like MultiManager, I will request modifications at that time.

Thanks

@orso82
Copy link
Member

orso82 commented Jul 24, 2024

@satelite2517 can you please comment on the performance of h5pyd?

If you find that it is slow, are you sure that's not an OMAS problem? Substituting h5pyd for h5py may work, but the h5 save/load implementation in OMAS has not been done with the idea to minimize the number of calls to it, as you'd want to do if each query resulted in a remote call... There might be ways to write new OMAS save/load functions that minimize the number of h5pyd remote calls. I see for example https://docs.h5py.org/en/stable/high/group.html#h5py.Group.visititems

With this in mind, I would suggest that you first try to benchmark the HSDS performance directly using h5pyd. Maybe something along these lines (untested!):

h5_file = h5pyd.File("http://127.0.0.0:5101/home/sample.h5", 'r')

datasets = []
def visitor_func(name, obj):
    if isinstance(obj, h5pyd.Dataset):
        datasets.append(name)
f.visititems(visitor_func)

for dataset in datasets:
    h5_file[dataset].value

h5_file.close()

I also found this post that goes into details about Local HSDS performance vs local HDF5 files
https://forum.hdfgroup.org/t/local-hsds-performance-vs-local-hdf5-files/8652/3
there's something about setting the flag http_compression= :false resulting in much better performance

@satelite2517
Copy link
Author

satelite2517 commented Jul 25, 2024

As you noticed from the link you sent link, using h5pyd to access data takes more than 20 times longer compared to using h5py. I tested uploading the same file using h5py in omas_h5, which took about 20 seconds, whereas using h5pyd took around 50 minutes. Thus, I concluded that it is impossible to include static data in HSDS.

Before proceeding with this pull request, I tried to optimize the omas function to increase speed. Here are the attempts I made:

  • Using hsget without Python
  • Using REST API with GET method
  • Directly using h5pyd with a non-recursive function version or storing lists before converting to ods
  • Adding concurrent.futures -> resulted in error

Thank you for informing me about the http_compression= . However, our original override.yml file already has it set to false, so it might not have a significant impact.

h5_file = h5pyd.File("/public/dynamic_test.h5", 'r')

datasets = []
def visitor_func(name, obj):
    print(name)
    if isinstance(obj, h5pyd.Dataset):
        datasets.append(name)

h5_file.visititems(visitor_func)

for dataset in datasets:
    h5_file[dataset].value

h5_file.close()

datasets

I tested this function just in case, and it took an average of 2 minutes and 37 seconds. On the other hand:

import omas
filename = '/public/dynamic_test.h5'
ods = omas.ODS()

ods = omas.load_omas_h5(filename)

This function took 2 minutes and 40 seconds, so there isn't a significant difference in speed. The options I could think to improve speed through chunking or using MultiManager, but it seems these are not fully implemented yet. So I plan to add these in the future. If you have any other ideas, I would greatly appreciate it.

Additionally, the biggest issue is the time taken to upload to HSDS. I have been in continuous communication with the HSDS developer, and I will share part of his response:


John Readey
Thursday 3:30 PM
Thanks for the file sample - I’ll try hsloading this and see how it goes.
I do notice that the file has about 70K objects (groups or datasets). HSDS is not very efficient when loading lots of small objects. Still I can investigate speeding up hsload by using similar techniques to MultiManager..


It seems that the current HSDS system is not well-prepared to handle such kind of dataset. Additionally, due to these issues, HSDS developer asked me a meeting to discuss system upgrades and potential solutions and what will be the optimized way with ods and HSDS. I will update you with any progress after the meeting next Monday.

@orso82
Copy link
Member

orso82 commented Jul 25, 2024

@satelite2517 could you please provide me with a copy of the h5 file you are using for your benchmarks? If it's not crazy big it would be great if you simply upload it as part of this issue.

@satelite2517
Copy link
Author

Sending you the file does not matter but git issue does not support h5 file format. Would you want me to send you in another way?

@orso82
Copy link
Member

orso82 commented Jul 25, 2024

For the record, I moved the files that you sent here:
https://www.dropbox.com/scl/fi/ks1frkbtoupavx0l7byqx/39020_16.h5?rlkey=y0zmlkynfuq2nmnlt63lb35tq&dl=0

You mentioned that using this file (I assume reading it?) took about 70 minutes on your computer, and with HSDS it took 108 minutes.

I am surprised it's taking so long. When I run this:

from omas import *
from time import *
tic=time(); ods=load_omas_h5("/Users/meneghini/Downloads/39020_16.h5"); toc=time();
print(toc-tic)

I get the data back in 24.6 seconds

How are you running your tests?

@satelite2517
Copy link
Author

satelite2517 commented Jul 25, 2024

Did you load the file from your HSDS? Using the file in local took about 20 seconds to me too. The 70 min I told you was to load the file from HSDS server.

@orso82
Copy link
Member

orso82 commented Jul 25, 2024

Ok. I now understand.

And what was the 108 minutes?

@satelite2517
Copy link
Author

satelite2517 commented Jul 25, 2024

I sent this same file to the HSDS developer(another HSDS server) and he told me that it took about 108 min. I I may have made a slight mistake in conveying my message due to my limited English proficiency. Sorry

@orso82
Copy link
Member

orso82 commented Jul 25, 2024

This is no good :( but perhaps not surprising. Ideally one would want to be able to reduce the number of queries that are made to the server. For example, one could make one single query to request what data is available under a specific location in the data structure. With that meta-data in hand the client could then request all the data it needs in a single request. I bet this would speed up the data fetching enormously.

By the way, reading the same file in Julia with IMASDD.jl takes less than 3 seconds.
image
Haha, this makes me want to write our own REST API and service to serve IMAS data from HDF5 files :)

Perhaps you can ask the HSDS developer that you are in contact with if there's a way to do what I described above. 1. Retrieve metadata about the structure of the HDF5 file with one single query, and 2. request multiple nodes in the HDF5 file also with one single query.

@satelite2517
Copy link
Author

Okay. I understand what you said. I will request to see if improvements are possible. Thanks, and I will update the new issue.
👍

Copy link

This PR has not seen any activity in the past 60 days. It is now marked as stale and will be closed in 7 days if no further activity is registered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants