Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use kerchunk to join variables and ensembles!? #8

Open
rsignell-usgs opened this issue Jun 22, 2022 · 21 comments
Open

Use kerchunk to join variables and ensembles!? #8

rsignell-usgs opened this issue Jun 22, 2022 · 21 comments

Comments

@rsignell-usgs
Copy link
Collaborator

@peterm790, please take a look at https://discourse.pangeo.io/t/efficient-access-of-ensemble-data-on-aws/2530 and see if you think the current version of kerchunk could handle it!

@peterm790
Copy link
Collaborator

@rsignell-usgs I think this could work. I will have a go at using the file names, as in #4, append the ensemble member number to the variable names, which I think would allow for creating a single virtual dataset.

@peterm790
Copy link
Collaborator

peterm790 commented Jun 23, 2022

I suspect there may be a better way to do this, but for now the following will work to create a virtual dataset across all ensemble members: https://nbviewer.org/gist/peterm790/f50179e19ca54de63896d7340bd5c878

This obviously only deals with creating a virtual dataset across the ensemble, not updating when new runs are available

@martindurant
Copy link
Member

Nice to show it can be done!
I'm pretty sure that you can replace the postprocess with coo_map={"VARNAME": ...} where you would fill in the ellipsis with a regex (re.Pattern) that gets the variable name from the filename.

@rsignell-usgs
Copy link
Collaborator Author

@peterm790 nice work making this workflow compact and efficient!

I was wondering whether it would be possible to create an ensemble dimension using kerchunk instead the ensemble members being in different variables.

@martindurant
Copy link
Member

would be possible to create an ensemble dimension

This would have labels rather than coordinates? I don't see why not; we would need to specify an object dtype and maybe somewhere specify the encoding for this (JSON would be fine).

@peterm790
Copy link
Collaborator

@rsignell-usgs I think having the ensembles along a dimension would be neater although I'm not entirely sure if we can add a dimension using kerchunk? I could maybe modify the crs dimension to achieve this?

this will work however:

  flist = fs2.glob('/home/peterm790/shared/users/petermarsh/ensemble_NWM/*.json')
  
  d_=[]
  for file in flist:
      fs = fsspec.filesystem("reference", fo=file, ref_storage_args={'skip_instance_cache':True},
                         remote_protocol='s3', remote_options={'anon':False})
      d_.append(fs.get_mapper(""))
      
  ds_kerchunk = xr.open_mfdataset(d_, engine="zarr", backend_kwargs={'consolidated':False}, chunks={},combine = 'nested', concat_dim = [list(range(1,8))])

@martindurant
Copy link
Member

It xarray can concat it, kerchunk can :)

@rsignell-usgs
Copy link
Collaborator Author

In xarray concat it's possible to specify a new dimension name.

@martindurant
Copy link
Member

I added the following test to the codebase to show that kerchunk already supports this:

def test_var_dimension(refs):
    mzz = MultiZarrToZarr([refs["simple1"], refs["simple_var1"]], remote_protocol="memory",
                          coo_map={"varname": "VARNAME", "var": "output"})
    out = mzz.translate()
    z = xr.open_dataset(
        "reference://",
        backend_kwargs={"storage_options": {"fo": out, "remote_protocol": "memory"},
                        "consolidated": False},
        engine="zarr",
        chunks={}
    )
    assert list(z) == ["output"]
    assert list(z.dims) == ["varname", "x", "y"]
    assert list(z.varname) == ["data", "datum"]
    assert z.output.shape == (2, 10, 10)
    assert (z.output.values == arr).all()

We say that dimension "varname" is created from the names of the variables in the input, and that the variable name in the output is to be "output".

@martindurant
Copy link
Member

Will leave this open pending a successful example more complex than my test case.

@rsignell-usgs
Copy link
Collaborator Author

Hopefully @peterm790 can try this out on the National Water Model ensemble members!

@peterm790
Copy link
Collaborator

Okay so I have had some success creating the ensemble using the MultiZarrtoZarr regex method:

  ex = re.compile(r'\d+(?=.json)')

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex},
                      concat_dims = ['ensemble'],
                      identical_dims = ['reference_time', 'time', 'feature_id'])

(Here's the full work flow: https://gist.github.com/peterm790/c8c08a1e23aacfa007844cbc5affef66)

I did however have to modify the regex used in MultiZarrtoZarr to a search rather than match: fsspec/kerchunk@a37acc0

I am not sure if that is necessary and could well be due to me being quite new to regex, although perhaps having an option for the user to set regex='match' or regex='search' could be a good idea?

Then a final issue is the above results in feature_id being all Nan - I will see if I can get to the bottom of why this is today

@peterm790
Copy link
Collaborator

peterm790 commented Jun 27, 2022

In fact the above does not work as all variables are only Nan. The following will map all coordinates correctly but variables are all still Nan

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex},
                      concat_dims = ['feature_id', 'ensemble'],
                      identical_dims = ['reference_time', 'time'])

Which is odd as the mappings within the final json's (for instance ['nudge/0.0.0']) are identical between this and the initial version which renamed the variables using postprocess

@martindurant
Copy link
Member

The following regex appears to work: r'.*?(\d+).json'.

Also I see "remote_protocol": "memory" - this should still be "s3", I think, the location of the original data files.

@martindurant
Copy link
Member

By the way, you may find it helpful in debugging this kind of thing to turn on logging, e.g.,

fsspec.utils.setup_logging(logger_name="fsspec.reference")  # the reference file systems
fsspec.utils.setup_logging(logger_name="s3fs")  # S3 calls

@peterm790
Copy link
Collaborator

Aah right! Sorry, forgetting to change the "remote_protocol" was a very silly error. In that case the following workflow works to create a virtual dataset from the NWM ensemble: https://nbviewer.org/gist/peterm790/6e4826ed83218c111f71c9fe95e69163

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Jun 28, 2022

Very cool Peter!

Just a note to answer a issue I saw you raised in that notebook -- that you can't create the JSON for the new data without actually accessing the files because the byte ranges change.

Unless the data stored uses no compression (e.g. NetCDF3), the byte ranges will always be changing because the blocks will compress different amounts as the values change. And even for NetCDF3, the byte ranges would likely change because the metadata will be changing. Make sense?

@martindurant
Copy link
Member

@rsignell-usgs : I can't remember, is CRS supposed to be considered a coordinate, or is it "identical" in all inputs? It doesn't matter that it has 300 8-byte chunks, but perhaps it's a little untidy.

@peterm790
Copy link
Collaborator

@rsignell-usgs thanks. Yes this makes sense, a pity as it would be pretty neat to automatically generate kerchunks for new model runs

@martindurant CRS is identical across each ensemble member (as well as across time)

@rsignell-usgs
Copy link
Collaborator Author

And yes, as @peterm790 said, CRS will always be just a single value for a variable, and usually just a single value for the entire dataset.

@peterm790
Copy link
Collaborator

Not really sure where I was going with this but just to explore removing CRS as a variable while keeping the CRS attributes I set this up using postprocess and the kerchunk.combine.drop() method:

https://nbviewer.org/gist/peterm790/35ca1858db8981c3d4a234f2f48b5975

  def postprocess(refs):
      with open(flist[0]) as json_file:
          esri_pe_string = eval(ujson.load(json_file)['refs']['crs/.zattrs'])['esri_pe_string']
      crs_attrs = eval(refs['crs/.zattrs'])
      crs_attrs["esri_pe_string"] = esri_pe_string
      refs['crs/.zattrs'] = ujson.dumps(crs_attrs)
      return refs
  
  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex, 'crs':''},
                      concat_dims = ['ensemble'],
                      identical_dims = ['feature_id', 'reference_time', 'time'],
                      preprocess = kerchunk.combine.drop('crs'),
                      postprocess = postprocess)
  
  out = mzz.translate()

Again not really sure if this is necessary but maybe a good demonstration of how to remove variables as well as how to modify attributes using postprocess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants