Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid monolithic pipeline state file #45

Open
sebhoerl opened this issue Jul 22, 2020 · 3 comments
Open

Avoid monolithic pipeline state file #45

sebhoerl opened this issue Jul 22, 2020 · 3 comments

Comments

@sebhoerl
Copy link
Contributor

Currently, the pipeline.json file containing the state of all stages is one big monolithic file. This comes with problems, for instance when one wants to run the pipeline multiple times in parallel, for instance with different random seeds. This can lead to race conditions in which the pipeline.json is updated by one process, and read by the other one, etc.

Ideally, meta information about stages could be distributed in the relevant folders of the stages.

@ainar
Copy link
Contributor

ainar commented Feb 19, 2023

I suggest storing the hash digest of the module and all its dependencies and serializing the validation output, if needed, in the cache file and directory names, as already done for the configuration hash digest. The devalidation (should we say invalidation?) based on the module hash digest would be implicit, like the configuration check.
In consequence, the other devalidation steps would not be needed anymore, as well as the pipeline.json file, because all are replaced by the check of an existing cache file:

  • "Devalidate if parent has been updated": the parent has been updated if the code or the configuration has been changed, which is already tracked by the cache existence. The only exception is if we manually require a stage (say, stage A) and then, in a second run, require a grand-descendent stage (say, C). Then, to devalidate the stage between A and C, say stage B, we could check if the stage B cache is older than the stage A cache. The idea is to propagate the devalidation if there are more stages between A and C.
  • "Devalidate if parents are not the same anymore": if the dependencies list is not the same, the module hash digest will update because it encompasses all the code of all the dependencies.
  • "Devalidate descendants of devalidated stages": this step does not rely on the metadata, but I wonder if it would still be needed. We would expect from descendants of devalidated stages a different configuration and/or code digest, so it will be devalidated by the cache checking anyways.

To manage the case of simultaneous runs with devalidated stages overlapping, we could generate a temporary cache file during execution, and check before each stage execution if the corresponding cache file exists (and its parent caches are older) to avoid re-running a stage that a simultaneous process has just ended.

What do you think?

@sebhoerl
Copy link
Contributor Author

Yes, sounds good :)

@ainar
Copy link
Contributor

ainar commented Feb 24, 2023

I forgot about the "info". They could have a separate file for each step in the "cache" folder. I try a PoC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants