Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File output size and chunking #743

Closed
TonyB9000 opened this issue Jun 7, 2024 · 10 comments
Closed

File output size and chunking #743

TonyB9000 opened this issue Jun 7, 2024 · 10 comments

Comments

@TonyB9000
Copy link

TonyB9000 commented Jun 7, 2024

It appears that Cmorized output can be rendered 14% smaller while retaining BFB-identical data values. If this is true, then (modulo performance issues) cmor output should accommodate by default.

BACKGROUND:

I accidentally created several CMIP6 datasets for Omon variables, where 150 years was output to a single file. Example:

-rw-rw-r--. 1 bartoletti1 publishers 12535184888 May 16 13:15 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_000101-015012.nc

The size (12.5 GB on disk) was considered excessive, so I sought the advice of NCO developer Charlie Zender to learn if there were a means of breaking this output file into smaller "20-year" segments. This was accomplished with multiple calls to "ncrcat":

ncrcat -O -d time,<start_month_offset>,<end_month_offset> <inputfile> <outputname>

By this, I obtained:

  -rw-rw-r--. 1 bartoletti1 publishers  1429663865 May 31 14:23 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_000101-002012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  1431702449 May 31 14:24 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_002101-004012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  1432956469 May 31 14:26 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_004101-006012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  1433204736 May 31 14:28 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_006101-008012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  1432729066 May 31 14:30 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_008101-010012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  1432500799 May 31 14:32 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_010101-012012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  1432537156 May 31 14:33 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_012101-014012.nc
  -rw-rw-r--. 1 bartoletti1 publishers   716692845 May 31 14:34 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_014101-015012.nc

However, the sum of these sizes is only 10,741,987,385 bytes (86% of original size).

Alarmed that something was amiss, I set about revamping my E2C control script to segment input by years, and cycle over calls to E2C accordingly. The result:

  -rw-rw-r--. 1 bartoletti1 publishers 1667382790 Jun  3 10:21 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_000101-002012.nc
  -rw-rw-r--. 1 bartoletti1 publishers 1670098627 Jun  3 10:48 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_002101-004012.nc
  -rw-rw-r--. 1 bartoletti1 publishers 1671936984 Jun  3 11:16 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_004101-006012.nc
  -rw-rw-r--. 1 bartoletti1 publishers 1671951047 Jun  3 11:44 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_006101-008012.nc
  -rw-rw-r--. 1 bartoletti1 publishers 1671957729 Jun  3 12:11 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_008101-010012.nc
  -rw-rw-r--. 1 bartoletti1 publishers 1672638563 Jun  3 12:38 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_010101-012012.nc
  -rw-rw-r--. 1 bartoletti1 publishers 1673099166 Jun  3 13:05 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_012101-014012.nc
  -rw-rw-r--. 1 bartoletti1 publishers  836646640 Jun  3 13:15 so_Omon_E3SM-2-1_1pctCO2_r1i1p1f1_gr_014101-015012.nc

and their sum is back to 12535711546, close to the original 1-file size of 12.53GB.

I was pleasantly surprised when Charlie tested both with “ncdiff”, and concluded that the files (data-wise) were BFB identical. (But are they both “legitimate CMIP6-identical?)

Using “ncks -D 2 --hdn -M -m” to expose some “hidden” metadata, the difference is revealed:

Post-CMOR:

lat:_Storage = "contiguous" ; // char

Post-ncrcat “splitter”:

       lat:_Storage = "chunked" ; // char
       lat:_ChunkSizes = 180 ; // int
       lat:_DeflateLevel = 1 ; // int
       lat:_Shuffle = "true" ; // char

I don’t honestly know the technical differences in storage format, or if one allows greater compression, or if there is a performance hit here. The post-generation “splitter” requires about 1 minute/GB (with 1 worker). But if it were possible to save ~15% in disk storage and network transfer size, this seems like something one would want to pursue – all else being equal. I just wonder if is possible to go with “Chunked” storage natively in CMOR output, rather than post-process the files.

@matthew-mizielinski
Copy link

Hi @TonyB9000,

If you have a look at the CMOR API there is a call to https://cmor.llnl.gov/mydoc_cmor3_api/#cmor_set_deflate that allows you to set the deflation and bit-shuffle* option for a variable, before you start writing data. Note that deflation comes at a cost for users of the data; deflation above level 2 is not recommended and I have a vague recollection that compression of coordinate variables can cause issues for users too.

I thought that this recommendation was documented somewhere, but I can only find passing references to it in a few places online. We must do better for CMIP7

*improves the deflation performance.

@TonyB9000
Copy link
Author

Hi @matthew-mizielinski,

Thanks for the info! I see what you mean about "documented". The API mentions only the values you can give, and nothing about the effects it might have. Perhaps I'll experiment with the time-space tradeoffs. Perhaps Charlie Zender will have some insight. I'll also poke the interwebs. Cheers!

@matthew-mizielinski
Copy link

@TobyB9000,

There are plans afoot to support more of the netcdf api for reducing precision within CMOR3 as part of a 3.9.x release -- if used appropriately this will help with storage, but there are risks of losing science value in data if too much precision is removed (consider residuals in energy/water budget calculations for example).

Please close this issue if you are happy this discussion has covered your query.

@TonyB9000
Copy link
Author

Hi @matthew-mizielinski,

Knowing I can access CMOR deflate and shuffle API through the e3sm_to_cmip code is sufficient! Closing Issue, and Thanks!

@durack1
Copy link
Contributor

durack1 commented Jun 11, 2024

I have a vague recollection that compression of coordinate variables can cause issues for users too.

@matthew-mizielinski thanks for raising this, FYI, we have a request for coordinate compression in #674, so if there are problems that we need to consider it'd be great to bring these up so we don't create two problems by solving 1

@durack1
Copy link
Contributor

durack1 commented Jun 11, 2024

Thanks for the info! I see what you mean about "documented". The API mentions only the values you can give, and nothing about the effects it might have.

This is really hard to effectively document. If you have land-only data, and have your mask assigned correctly, you should get extremely good deflation stats, as ~70% of your grid is missing. Similar thing with sea-ice/siconc and variables that have a huge % of missing data. To correctly document this you'd need to capture these 1) mask differences, 2) data with very large ranges, and very small ranges (e.g. ocean salinity is mostly between 30.0 and 40.0 PSS-78, whereas some other variables span orders of magnitude more values), amongst numerous other data specifics. If I remember correctly, the shift in units from CMIP5 thetao = K to CMIP6 thetao = degC saved us some space, but maybe I am remembering this around the wrong way.

There are also interweb resources that give you some tidbits to consider, e.g. here - Unidata per variable compression example, here - their/Unidata generic compression advice, here - @czender's E3SM NCO lossy compression post and here - DKRZ guidance with some xarray examples of lossy and lossless compression options

@taylor13
Copy link
Collaborator

If you're talking lossless compression, you'll get better compression of K (rather than C) because one of the significant figures is either "2" (as in 290 K) or "3" (as in 302 K). In effect the precision of your number in K is lower than the precision in C.

@durack1
Copy link
Contributor

durack1 commented Jun 11, 2024

you'll get better compression of K (rather than C)

Ok so my memory was the inverse of reality... Which as I was writing it, I had wondered, unless of course some weird mask quirk did actually lead to my inferred results.. I could of course test this out for myself, but might defer that adventure to another day

@durack1
Copy link
Contributor

durack1 commented Jun 11, 2024

There are plans afoot to support more of the netcdf api for reducing precision within CMOR3 as part of a 3.9.x release -- if used appropriately this will help with storage, but there are risks of losing science value in data if too much precision is removed (consider residuals in energy/water budget calculations for example).

@matthew-mizielinski is 100% correct, buyer beware for sure. For an investigation as to what you could gain (or lose) take a peek at Klower et al., 2021 - Compressing atmospheric data into its real information content, along with Baker et al., 2016 - Evaluating lossy data compression on climate simulation data within a large ensemble.

To see how this is being done within a model, see Klower et al., 2020 - Number Formats, Error Mitigation, and Scope for 16-Bit Arithmetics in Weather and Climate Modeling Analyzed With a Shallow Water Model, Paxton et al., 2022 - Climate Modeling in Low Precision: Effects of Both Deterministic and Stochastic Rounding and Milroy et al., 2019 - Investigating the Impact of Mixed Precision on Correctness for a Large Climate Code

And just to put it here Zarr compressors

@TonyB9000
Copy link
Author

Hi @matthew-mizielinski,

Thanks for the info! I see what you mean about "documented". The API mentions only the values you can give, and nothing about the effects it might have. Perhaps I'll experiment with the time-space tradeoffs. Perhaps Charlie Zender will have some insight. I'll also poke the interwebs.

Addendum: The default CMOR output deflate level = 1, and is left unchanged by "set_deflate(varid,True,True,1)". The first "True" applies "shuffle", which gave the 14% file size reduction - but early tests indicate that this incurs a 50+% performance hit. I intend to make this an E2C option, but not the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants