Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support incremental appending #37

Closed
rabernat opened this issue Jan 22, 2021 · 9 comments
Closed

Support incremental appending #37

rabernat opened this issue Jan 22, 2021 · 9 comments
Labels
design question A question of the design of Pangeo Forge recipe enhancement Solving this requires us to enhance the recipe classes

Comments

@rabernat
Copy link
Contributor

Currently, when a recipe is run, it will always cache all of the inputs and write all of the chunks. However, it would be nice to have an option where, if the target already exists, it only write NEW chunks. This raises some design questions.

  • Currently, the target is never read until we start to execute the recipe (not until the prepare_target stage). However, for this to work, the iter_inputs() and iter_chunks() methods needs to know which inputs and chunks to process. In order to build the pipeline for execution, this information needs to already be inside the recipe object. So this implies that we need open the target in __post_init__. Could this cause problems?
  • How do we align the recipe with the target? For the standard NetCDFZarrSequential recipe, it may be as simple as looking at the length of the sequence dimension: if the target has 100 items but the recipe has 120, we assume the last 20 need to be appended. But are there edge cases to worry about?

This intersects a bit with the "versioning" question in #3.

If we agree on the answers to the questions above, I think we can move ahead with implementing incremental updates to the NetCDFZarrSequentialRecipe class.

@rabernat rabernat added design question A question of the design of Pangeo Forge recipe enhancement Solving this requires us to enhance the recipe classes labels Jan 22, 2021
@rabernat
Copy link
Contributor Author

Appending is significantly more complicated for the case discussed in #50: variable items per input. In this case, we don't know the size of the target dataset from the outset, so we can't use simple heuristics like the one I proposed above to figure out the append region.

Maybe recipes can implement their own methods for examining the target and determining which input chunks are needed? For that, it seems like the recipe would have to know more about the inputs than just a list of paths. For instance, it might have to understand time. What if input_urls were a dictionary rather than list, and the keys held some semantic meaning that could be used to compare to the target?

@davidbrochart
Copy link
Contributor

Maybe instead of looking at what has been produced, we could look at what has been consumed? For instance, after running a recipe, we could store the list of input files that were processed, so that when we get a new list next time the recipe is run, we look at the already processed input files and restore a "resuming" state. That could be done by having a "dry run", that would run the recipe without actually producing anything. We would still have the list of total processed input files, which might be needed when we "finalize" the target.
I don't know where we could store the list of processed input files, probably alongside the target, that seems the more natural. On the source side, tying a recipe to a target doesn't seem right.

@rabernat
Copy link
Contributor Author

This is a good idea David. Perhaps we could store the list of input files directly in the target dataset metadata itself (attrs). This would be useful for incremental appending but also for general provenance tracking.

@davidbrochart
Copy link
Contributor

Perhaps we could store the list of input files directly in the target dataset metadata itself (attrs).

I was thinking about that, but what if the target is not in the Zarr format? Do other formats all have metadata that we can use for this purpose? I don't know COG that much, but I'm not sure it does for instance.

@martindurant
Copy link
Contributor

martindurant commented Feb 21, 2021 via email

@rabernat
Copy link
Contributor Author

rabernat commented May 5, 2022

The input hashing stuff introduced by @cisaacstern in #349 should make this doable. The user story for this is being tracked in pangeo-forge/user-stories#5.

Charles, would you be game for diving into this and developing a prototype?

@cisaacstern
Copy link
Member

cisaacstern commented May 5, 2022

In order for this to work within pangeo-forge-recipes entirely (without external information from the database/orchestration layer), we'll need to leave some metadata (i.e. recipe and/or pattern hashes) in the target store. Based on reading the thread, it seems like it could be okay to put this in .zmetadata?

@rabernat
Copy link
Contributor Author

rabernat commented May 5, 2022

That sounds reasonable to me. I think in general we should be injecting extra metatdata into the datasets we write. Stuff like

{
    "pangeo-forge:version": 0.6.2,
    "pangeo-forge:recipe-hash": "a1b2c3",
    "pangeo-forge:input-hash": "..."
}

Addressing that as a standalone issue would be a good place to start.

@cisaacstern
Copy link
Member

Per conversation at today's coordination meeting, people felt it would be simpler to have a single tracking issue for appending, so closing this and directing further discussion to #447.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design question A question of the design of Pangeo Forge recipe enhancement Solving this requires us to enhance the recipe classes
Projects
Development

No branches or pull requests

4 participants