new I/O bottleneck as a result of reformatting #48

gaelforget · 2020-05-12T12:16:17Z

On a different channel, @hongandyan noted that

But it took ~204 seconds to load 312 monthly v4 ETAN from 312 files as opposed to 1.4 seconds to load 288 monthly v3 ETAN from 13 files. Perhaps this arrangement has more benefits to ECCO-v4.py

I have not tried but this looks like a major set back and inconvenience to users!

I see this is as a separate issue from #40 but it's not unrelated cause it stems from the same reformatting that is looking more and more like its creating major problems.

The only simple solution now might be that the ECCO team at JPL & UT just adds another folder with v4r4 etc in the original nctiles formatting and file layout used in earlier releases. And then add guidelines in the READMEs to let users know which version might be best depending on whether they use ECCO-v4.py, gcmfaces, or other known software.

Linking @ifenty, @owang01, and @timothyas here as they seem likely know who is responsible for ECCO files at JPL & UT under the relevant NASA grant (not sure I even have a copy...)

timothyas · 2020-05-12T14:02:49Z

Yes this is something I've experienced with dask (a tool underlying xarray). It seems like dask generally performs better with larger but fewer files. It doesn't totally matter how big the files are (probably to a certain threshold, that I haven't yet experienced) because dask doesn't read things into memory when a "read" call is made, it just figures out how to point to the files, which can take a long time with tons of files.

That said, perhaps it is the most flexible to continue making the product available in more files rather than fewer so that users can 1) only download the files they need and 2) not have to worry about accidentally loading a huge file into memory. What does everyone else think?

In that case, then we could recommend that eccov4py users 'reformat' the data into a format that benefits their workflow the best. For instance, users could load and re-save all 2D variables in a single file, or one file per year, or however they want it. They could additionally save it to zarr, or whatever file format they prefer and are familiar with.

What do you think? I'm not in any decision making position and I'm not in charge of the file formats. I'm just providing (hopefully useful?!) suggestions :) I would be happy to provide some suggested lines as above for a README that @gaelforget is recommending, just let me know if you'd like that.

hongandyan · 2020-05-12T20:55:22Z

That said, perhaps it is the most flexible to continue making the product available in more files rather than fewer so that users can 1) only download the files they need and 2) not have to worry about accidentally loading a huge file into memory. What does everyone else think?

I don't think scenario(2) will happen because "read_nctiles" is flexible in
reading time records or vertical levels.
Actually "read_nctiles" is OK for r4, the major problem is "grid_load" which
can't handle the incompatibility between r3/nctiles_grid/GRID*.nc and r4/ nctiles_grid/ECCO-GRID*.nc as pointed here
#40 (comment)

owang01 · 2020-05-13T04:58:43Z

@gaelforget I submitted a pull request to the gcmfaces git for an updated read_nctiles.m that should improve the performance when reading V4r4 files. It now takes about the same time to read V4r3 or V4r4 files. See the pull at MITgcm/gcmfaces#12.

gaelforget · 2020-05-13T15:48:14Z

@gaelforget I submitted a pull request to the gcmfaces git for an updated read_nctiles.m that should improve the performance when reading V4r4 files. It now takes about the same time to read V4r3 or V4r4 files. See the pull at MITgcm/gcmfaces#12.

Will take a look as soon as possible & report back after I've had a chance to test on standard analysis for r2, r3, and r4. Hopefully by next week (but ...)

Thanks!!!

owang01 mentioned this issue May 13, 2020

need to support nctiles variant used in ECCO v4 r4 MITgcm/gcmfaces#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new I/O bottleneck as a result of reformatting #48

new I/O bottleneck as a result of reformatting #48

gaelforget commented May 12, 2020

timothyas commented May 12, 2020 •

edited

Loading

hongandyan commented May 12, 2020

owang01 commented May 13, 2020

gaelforget commented May 13, 2020

new I/O bottleneck as a result of reformatting #48

new I/O bottleneck as a result of reformatting #48

Comments

gaelforget commented May 12, 2020

timothyas commented May 12, 2020 • edited Loading

hongandyan commented May 12, 2020

owang01 commented May 13, 2020

gaelforget commented May 13, 2020

timothyas commented May 12, 2020 •

edited

Loading