Use kerchunk to join variables and ensembles!? #8

rsignell-usgs · 2022-06-22T19:27:43Z

@peterm790, please take a look at https://discourse.pangeo.io/t/efficient-access-of-ensemble-data-on-aws/2530 and see if you think the current version of kerchunk could handle it!

peterm790 · 2022-06-23T12:45:33Z

@rsignell-usgs I think this could work. I will have a go at using the file names, as in #4, append the ensemble member number to the variable names, which I think would allow for creating a single virtual dataset.

peterm790 · 2022-06-23T15:52:27Z

I suspect there may be a better way to do this, but for now the following will work to create a virtual dataset across all ensemble members: https://nbviewer.org/gist/peterm790/f50179e19ca54de63896d7340bd5c878

This obviously only deals with creating a virtual dataset across the ensemble, not updating when new runs are available

martindurant · 2022-06-23T15:57:32Z

Nice to show it can be done!
I'm pretty sure that you can replace the postprocess with coo_map={"VARNAME": ...} where you would fill in the ellipsis with a regex (re.Pattern) that gets the variable name from the filename.

rsignell-usgs · 2022-06-24T13:00:14Z

@peterm790 nice work making this workflow compact and efficient!

I was wondering whether it would be possible to create an ensemble dimension using kerchunk instead the ensemble members being in different variables.

martindurant · 2022-06-24T13:24:12Z

would be possible to create an ensemble dimension

This would have labels rather than coordinates? I don't see why not; we would need to specify an object dtype and maybe somewhere specify the encoding for this (JSON would be fine).

peterm790 · 2022-06-24T13:26:22Z

@rsignell-usgs I think having the ensembles along a dimension would be neater although I'm not entirely sure if we can add a dimension using kerchunk? I could maybe modify the crs dimension to achieve this?

this will work however:

  flist = fs2.glob('/home/peterm790/shared/users/petermarsh/ensemble_NWM/*.json')
  
  d_=[]
  for file in flist:
      fs = fsspec.filesystem("reference", fo=file, ref_storage_args={'skip_instance_cache':True},
                         remote_protocol='s3', remote_options={'anon':False})
      d_.append(fs.get_mapper(""))
      
  ds_kerchunk = xr.open_mfdataset(d_, engine="zarr", backend_kwargs={'consolidated':False}, chunks={},combine = 'nested', concat_dim = [list(range(1,8))])

martindurant · 2022-06-24T13:27:32Z

It xarray can concat it, kerchunk can :)

rsignell-usgs · 2022-06-24T13:51:53Z

In xarray concat it's possible to specify a new dimension name.

martindurant · 2022-06-24T16:17:44Z

I added the following test to the codebase to show that kerchunk already supports this:

def test_var_dimension(refs):
    mzz = MultiZarrToZarr([refs["simple1"], refs["simple_var1"]], remote_protocol="memory",
                          coo_map={"varname": "VARNAME", "var": "output"})
    out = mzz.translate()
    z = xr.open_dataset(
        "reference://",
        backend_kwargs={"storage_options": {"fo": out, "remote_protocol": "memory"},
                        "consolidated": False},
        engine="zarr",
        chunks={}
    )
    assert list(z) == ["output"]
    assert list(z.dims) == ["varname", "x", "y"]
    assert list(z.varname) == ["data", "datum"]
    assert z.output.shape == (2, 10, 10)
    assert (z.output.values == arr).all()

We say that dimension "varname" is created from the names of the variables in the input, and that the variable name in the output is to be "output".

martindurant · 2022-06-24T18:25:23Z

Will leave this open pending a successful example more complex than my test case.

rsignell-usgs · 2022-06-24T18:50:54Z

Hopefully @peterm790 can try this out on the National Water Model ensemble members!

peterm790 · 2022-06-27T09:49:54Z

Okay so I have had some success creating the ensemble using the MultiZarrtoZarr regex method:

  ex = re.compile(r'\d+(?=.json)')

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex},
                      concat_dims = ['ensemble'],
                      identical_dims = ['reference_time', 'time', 'feature_id'])

(Here's the full work flow: https://gist.github.com/peterm790/c8c08a1e23aacfa007844cbc5affef66)

I did however have to modify the regex used in MultiZarrtoZarr to a search rather than match: fsspec/kerchunk@a37acc0

I am not sure if that is necessary and could well be due to me being quite new to regex, although perhaps having an option for the user to set regex='match' or regex='search' could be a good idea?

Then a final issue is the above results in feature_id being all Nan - I will see if I can get to the bottom of why this is today

peterm790 · 2022-06-27T10:17:51Z

In fact the above does not work as all variables are only Nan. The following will map all coordinates correctly but variables are all still Nan

  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex},
                      concat_dims = ['feature_id', 'ensemble'],
                      identical_dims = ['reference_time', 'time'])

Which is odd as the mappings within the final json's (for instance ['nudge/0.0.0']) are identical between this and the initial version which renamed the variables using postprocess

martindurant · 2022-06-27T16:23:23Z

The following regex appears to work: r'.*?(\d+).json'.

Also I see "remote_protocol": "memory" - this should still be "s3", I think, the location of the original data files.

martindurant · 2022-06-27T16:24:32Z

By the way, you may find it helpful in debugging this kind of thing to turn on logging, e.g.,

fsspec.utils.setup_logging(logger_name="fsspec.reference")  # the reference file systems
fsspec.utils.setup_logging(logger_name="s3fs")  # S3 calls

peterm790 · 2022-06-28T07:33:32Z

Aah right! Sorry, forgetting to change the "remote_protocol" was a very silly error. In that case the following workflow works to create a virtual dataset from the NWM ensemble: https://nbviewer.org/gist/peterm790/6e4826ed83218c111f71c9fe95e69163

rsignell-usgs · 2022-06-28T11:34:13Z

Very cool Peter!

Just a note to answer a issue I saw you raised in that notebook -- that you can't create the JSON for the new data without actually accessing the files because the byte ranges change.

Unless the data stored uses no compression (e.g. NetCDF3), the byte ranges will always be changing because the blocks will compress different amounts as the values change. And even for NetCDF3, the byte ranges would likely change because the metadata will be changing. Make sense?

martindurant · 2022-06-28T13:03:13Z

@rsignell-usgs : I can't remember, is CRS supposed to be considered a coordinate, or is it "identical" in all inputs? It doesn't matter that it has 300 8-byte chunks, but perhaps it's a little untidy.

peterm790 · 2022-06-28T13:08:38Z

@rsignell-usgs thanks. Yes this makes sense, a pity as it would be pretty neat to automatically generate kerchunks for new model runs

@martindurant CRS is identical across each ensemble member (as well as across time)

rsignell-usgs · 2022-06-28T13:15:42Z

And yes, as @peterm790 said, CRS will always be just a single value for a variable, and usually just a single value for the entire dataset.

peterm790 · 2022-06-28T14:59:20Z

Not really sure where I was going with this but just to explore removing CRS as a variable while keeping the CRS attributes I set this up using postprocess and the kerchunk.combine.drop() method:

https://nbviewer.org/gist/peterm790/35ca1858db8981c3d4a234f2f48b5975

  def postprocess(refs):
      with open(flist[0]) as json_file:
          esri_pe_string = eval(ujson.load(json_file)['refs']['crs/.zattrs'])['esri_pe_string']
      crs_attrs = eval(refs['crs/.zattrs'])
      crs_attrs["esri_pe_string"] = esri_pe_string
      refs['crs/.zattrs'] = ujson.dumps(crs_attrs)
      return refs
  
  mzz = MultiZarrToZarr(flist, 
                      remote_protocol='s3',
                      remote_options={'anon':True},
                      coo_map={'ensemble' : ex, 'crs':''},
                      concat_dims = ['ensemble'],
                      identical_dims = ['feature_id', 'reference_time', 'time'],
                      preprocess = kerchunk.combine.drop('crs'),
                      postprocess = postprocess)
  
  out = mzz.translate()

Again not really sure if this is necessary but maybe a good demonstration of how to remove variables as well as how to modify attributes using postprocess

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use kerchunk to join variables and ensembles!? #8

Use kerchunk to join variables and ensembles!? #8

rsignell-usgs commented Jun 22, 2022

peterm790 commented Jun 23, 2022

peterm790 commented Jun 23, 2022 •

edited

Loading

martindurant commented Jun 23, 2022

rsignell-usgs commented Jun 24, 2022

martindurant commented Jun 24, 2022

peterm790 commented Jun 24, 2022

martindurant commented Jun 24, 2022

rsignell-usgs commented Jun 24, 2022

martindurant commented Jun 24, 2022

martindurant commented Jun 24, 2022

rsignell-usgs commented Jun 24, 2022

peterm790 commented Jun 27, 2022

peterm790 commented Jun 27, 2022 •

edited

Loading

martindurant commented Jun 27, 2022

martindurant commented Jun 27, 2022

peterm790 commented Jun 28, 2022

rsignell-usgs commented Jun 28, 2022 •

edited

Loading

martindurant commented Jun 28, 2022

peterm790 commented Jun 28, 2022

rsignell-usgs commented Jun 28, 2022

peterm790 commented Jun 28, 2022

Use kerchunk to join variables and ensembles!? #8

Use kerchunk to join variables and ensembles!? #8

Comments

rsignell-usgs commented Jun 22, 2022

peterm790 commented Jun 23, 2022

peterm790 commented Jun 23, 2022 • edited Loading

martindurant commented Jun 23, 2022

rsignell-usgs commented Jun 24, 2022

martindurant commented Jun 24, 2022

peterm790 commented Jun 24, 2022

martindurant commented Jun 24, 2022

rsignell-usgs commented Jun 24, 2022

martindurant commented Jun 24, 2022

martindurant commented Jun 24, 2022

rsignell-usgs commented Jun 24, 2022

peterm790 commented Jun 27, 2022

peterm790 commented Jun 27, 2022 • edited Loading

martindurant commented Jun 27, 2022

martindurant commented Jun 27, 2022

peterm790 commented Jun 28, 2022

rsignell-usgs commented Jun 28, 2022 • edited Loading

martindurant commented Jun 28, 2022

peterm790 commented Jun 28, 2022

rsignell-usgs commented Jun 28, 2022

peterm790 commented Jun 28, 2022

peterm790 commented Jun 23, 2022 •

edited

Loading

peterm790 commented Jun 27, 2022 •

edited

Loading

rsignell-usgs commented Jun 28, 2022 •

edited

Loading