Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new I/O bottleneck as a result of reformatting #48

Open
gaelforget opened this issue May 12, 2020 · 4 comments
Open

new I/O bottleneck as a result of reformatting #48

gaelforget opened this issue May 12, 2020 · 4 comments

Comments

@gaelforget
Copy link
Member

On a different channel, @hongandyan noted that

But it took ~204 seconds to load 312 monthly v4 ETAN from 312 files as opposed to 1.4 seconds to load 288 monthly v3 ETAN from 13 files. Perhaps this arrangement has more benefits to ECCO-v4.py

I have not tried but this looks like a major set back and inconvenience to users!

I see this is as a separate issue from #40 but it's not unrelated cause it stems from the same reformatting that is looking more and more like its creating major problems.

The only simple solution now might be that the ECCO team at JPL & UT just adds another folder with v4r4 etc in the original nctiles formatting and file layout used in earlier releases. And then add guidelines in the READMEs to let users know which version might be best depending on whether they use ECCO-v4.py, gcmfaces, or other known software.

Linking @ifenty, @owang01, and @timothyas here as they seem likely know who is responsible for ECCO files at JPL & UT under the relevant NASA grant (not sure I even have a copy...)

@timothyas
Copy link
Member

timothyas commented May 12, 2020

Yes this is something I've experienced with dask (a tool underlying xarray). It seems like dask generally performs better with larger but fewer files. It doesn't totally matter how big the files are (probably to a certain threshold, that I haven't yet experienced) because dask doesn't read things into memory when a "read" call is made, it just figures out how to point to the files, which can take a long time with tons of files.

That said, perhaps it is the most flexible to continue making the product available in more files rather than fewer so that users can 1) only download the files they need and 2) not have to worry about accidentally loading a huge file into memory. What does everyone else think?

In that case, then we could recommend that eccov4py users 'reformat' the data into a format that benefits their workflow the best. For instance, users could load and re-save all 2D variables in a single file, or one file per year, or however they want it. They could additionally save it to zarr, or whatever file format they prefer and are familiar with.

What do you think? I'm not in any decision making position and I'm not in charge of the file formats. I'm just providing (hopefully useful?!) suggestions :) I would be happy to provide some suggested lines as above for a README that @gaelforget is recommending, just let me know if you'd like that.

@hongandyan
Copy link

That said, perhaps it is the most flexible to continue making the product available in more files rather than fewer so that users can 1) only download the files they need and 2) not have to worry about accidentally loading a huge file into memory. What does everyone else think?

I don't think scenario(2) will happen because "read_nctiles" is flexible in
reading time records or vertical levels.
Actually "read_nctiles" is OK for r4, the major problem is "grid_load" which
can't handle the incompatibility between r3/nctiles_grid/GRID*.nc and r4/ nctiles_grid/ECCO-GRID*.nc as pointed here
#40 (comment)

@owang01
Copy link
Contributor

owang01 commented May 13, 2020

@gaelforget I submitted a pull request to the gcmfaces git for an updated read_nctiles.m that should improve the performance when reading V4r4 files. It now takes about the same time to read V4r3 or V4r4 files. See the pull at MITgcm/gcmfaces#12.

@gaelforget
Copy link
Member Author

@gaelforget I submitted a pull request to the gcmfaces git for an updated read_nctiles.m that should improve the performance when reading V4r4 files. It now takes about the same time to read V4r3 or V4r4 files. See the pull at MITgcm/gcmfaces#12.

Will take a look as soon as possible & report back after I've had a chance to test on standard analysis for r2, r3, and r4. Hopefully by next week (but ...)

Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants