-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cataloguing of individual experiments #274
Comments
I like this idea, with a few things related to the wider
|
Thanks @charles-turner-1 I think the builder can be determined automatically from the The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date. |
Cheers for the feedback @anton-seaice. Couple of things just to clarify:
This would only be in the case of regenerating a datastore right - I'm assuming the
I'm not quite sure I understand the logic here - if we regenerate the datastore every time the run is extended, then surely the datastore will stay up to date? Or is extending a model run different from the initial run in a way that makes this nontrivial? |
It's made by payu as long as the option to make it is on. https://payu.readthedocs.io/en/stable/usage.html#metadata-files
I guess that may not be totally robust though - folks will use old payu versions or configurations which don't update the datastore ? i.e. in this case:
are their feasible ways to confirm the integrity without re-making the whole datastore ? |
Ahh, gotcha. Since we're passing the outputs path to this utility function, I think it should be possible to run a subset of the datastore building pipeline in order to work out the expected time bounds. We should be able to work out if the catalog is only indexing a subset of the model outputs from this. I suspect that this might be a bit slow to run if we naively index the whole thing but we can probably make it fairly efficient if we put in some relevant information about how to outputs are structured. Do you think that would address the issue? |
Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated? If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step? |
Yeah, this is what I had in mind - I think it should be possible & relatively straightforward to implement .
This strikes me as a better solution - it will certainly be faster for the user. I don't think there is any potential for false negatives here - the catalog scans & indexes files on disk, so it can only index files created prior to the catalog. Unless there are complicated things going on behind the scenes with Payu, I think this is the way to go? |
@charles-turner-1, thanks for following this up. Looks good, I like that it has both a cli and python module interface. If it's not too hard, @marc-white's additions sound great too. Based on our chats I think we are all thinking this but it's not quite clear from the early post, that we want the ability to get the achieved path from payu if the user decides to not keep their data on scratch. I think there's still some clarity needed from our payu experts on how we choose to run this for some models and not others?
This was my naive thinking of a solution but that's due to my naivety of how these catalogs are built! |
Yup, I think Marc's suggestions are great & would be pretty straightforward to implement. Re. paths, I'm not sure I understand Payu well enough to say anything specific, but I think we should be able to make this feature path-agnostic internally, and just let the user/Payu pass a path to it. That way, Payu can build a default catalog (if desired?), and then if the user moves the data elsewhere, they can rebuild it straightforwardly using essentially the same tools, syntax, config, etc.. |
Sounds great. Also, this morning we were thinking it would be great if you could add the 'template for ACCESS-OM3 evaluation metrics' to this repo; discussed here. Actually, @anton-seaice , it's already been done! Any feedback @anton-seaice? I think once the above is done, it would be good to update it and the cosima recipes. |
replied in ACCESS-NRI/access-eval-recipes#5 |
Sometimes people would re-run the model to add extra diagnostics. How slow would doing a checksum on every file be to catch this case ?
The question I have about this one, is that sometimes people will touch every file in a folder on scratch to prevent it getting it deleted. Which might lead to a rebuild when its not needed? This is probably fine really ... |
Not awful, but even faster if something like binhash was used (the change detection hash However I don't think we want to encourage "re-running" a model. To my mind this would just be another experiment with its own catalogue. Re-running and overwriting an existing experiment seems kinda fraught and definitely isn't a supported mode of operation. |
Apart from this being against NCI guidelines, we really want to encourage users to utilise the excellent |
Side note, once this has been rolled out, would be worth mentioning on the hive docs: https://access-hive.org.au/models/run-a-model/run-access-om/#access-om2-outputs |
To summarise my thoughts - just using file paths to determine intake-esm datastore correctness is probably ok. Doing something more robust would be better, and if binhash is fast, then this would protect against the source data files changing (even if thats not the intended / trained usecase). |
Is your feature request related to a problem? Please describe.
Follows from Building intake-esm datastores in the Payu repository.
To minimise friction & get users used to the intake catalogue system, it would be good to provide a utility to generate (or open) a catalogue for an experiment just by pointing to the path that contains the outputs.
Describe the feature you'd like
Functionality to:
catalog.json
file contains an internal reference to it's location & moving a catalog therefore breaks it). In such cases, it would be nice to rebuild the catalog.This might looks something like the following:
From within python, we could then have a convenience function that looks something like:
@chrisb13 @anton-seaice are you able to confirm this is the sort of functionality we're after?
The text was updated successfully, but these errors were encountered: