-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support incremental appending #37
Comments
Appending is significantly more complicated for the case discussed in #50: variable items per input. In this case, we don't know the size of the target dataset from the outset, so we can't use simple heuristics like the one I proposed above to figure out the append region. Maybe recipes can implement their own methods for examining the target and determining which input chunks are needed? For that, it seems like the recipe would have to know more about the inputs than just a list of paths. For instance, it might have to understand time. What if |
Maybe instead of looking at what has been produced, we could look at what has been consumed? For instance, after running a recipe, we could store the list of input files that were processed, so that when we get a new list next time the recipe is run, we look at the already processed input files and restore a "resuming" state. That could be done by having a "dry run", that would run the recipe without actually producing anything. We would still have the list of total processed input files, which might be needed when we "finalize" the target. |
This is a good idea David. Perhaps we could store the list of input files directly in the target dataset metadata itself ( |
I was thinking about that, but what if the target is not in the Zarr format? Do other formats all have metadata that we can use for this purpose? I don't know COG that much, but I'm not sure it does for instance. |
Sounds like the kind of thing that naturally belongs in a catalogue, perhaps backed by a db, so that updates are safe to races.
…On February 20, 2021 5:27:06 PM EST, David Brochart ***@***.***> wrote:
> Perhaps we could store the list of input files directly in the target
dataset metadata itself (`attrs`).
I was thinking about that, but what if the target is not in the Zarr
format? Do other formats all have metadata that we can use for this
purpose? I don't know COG that much, but I'm not sure it does for
instance.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#37 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
The input hashing stuff introduced by @cisaacstern in #349 should make this doable. The user story for this is being tracked in pangeo-forge/user-stories#5. Charles, would you be game for diving into this and developing a prototype? |
In order for this to work within |
That sounds reasonable to me. I think in general we should be injecting extra metatdata into the datasets we write. Stuff like
Addressing that as a standalone issue would be a good place to start. |
Per conversation at today's coordination meeting, people felt it would be simpler to have a single tracking issue for appending, so closing this and directing further discussion to #447. |
Currently, when a recipe is run, it will always cache all of the inputs and write all of the chunks. However, it would be nice to have an option where, if the target already exists, it only write NEW chunks. This raises some design questions.
prepare_target
stage). However, for this to work, theiter_inputs()
anditer_chunks()
methods needs to know which inputs and chunks to process. In order to build the pipeline for execution, this information needs to already be inside the recipe object. So this implies that we need open the target in__post_init__
. Could this cause problems?This intersects a bit with the "versioning" question in #3.
If we agree on the answers to the questions above, I think we can move ahead with implementing incremental updates to the
NetCDFZarrSequentialRecipe
class.The text was updated successfully, but these errors were encountered: