-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip re-computing metadata cache. #243
Conversation
xref #224 |
@alxmrs - I think this change makes sense. To move forward, we would need two tests:
|
Alex, if we can just get a test for |
I'm happy to write unit tests for this – I will push a patch for this tomorrow. I'm investigating if an additional flag is needed for this feature – should the user be able to set a |
@jbusecke and I are working on some derived datasets, where the inputs are already in Zarr format on the cloud. In this case, we would not be caching inputs, and therefore need some way to call xref #224 |
My plan once this is merged is to start refactoring this recipe to have several [optional] standalone stages (#224), rather than just one big Here we have a choice--another option to add to the recipe config for This discussion reminds me a lot of - pangeo-forge/roadmap#29. Where in our workflow do we allocate and manage temporary storage? In pangeo-forge/roadmap#29 we discussed this for the target storage and resolved it with pangeo-forge/roadmap#34... But we have not really had that discussion for cache storage or metadata storage. Maybe this is something the recipe orchestrator could manage. Alex, can I ask why you need to recompute the metadata? This might help figure out the best way to support that need. |
… into xr-zarr-metadata-local
I hit an unrelated error to the metadata cache (I'll probably bring this up in our Monday meeting anyway; see below). I noticed in the error trace that some metadata may be off. I had iterated a bunch on the Recipe parameters as well as the structure of the input data. Since I had cached the metadata previously, I wanted to rule out the possibility that the pipeline was failing due to stale metadata. When I recomputed the metadata (by changing the path for temp storage), I noticed that the error trace changed. To be more specific: In the first run, it reported inconsistent sequence lengths across my data in merge dimensions (Data was stored by month, chunked by hour. Across every merge dimension — i.e. groups of variables — the number of hours per month should be the same). In the second run, while the sequence lengths across concat dimensions differed (different # of total hours per month), they were all the same in merge dimensions. Edit: I should mention, I'm working with a large Grib 2 dataset. |
A fix for #241.