Cataloguing of individual experiments #274

charles-turner-1 · 2024-11-27T23:31:01Z

Is your feature request related to a problem? Please describe.

Follows from Building intake-esm datastores in the Payu repository.

To minimise friction & get users used to the intake catalogue system, it would be good to provide a utility to generate (or open) a catalogue for an experiment just by pointing to the path that contains the outputs.

Describe the feature you'd like

Functionality to:

Point to the already known directory where an experiment output is saved & automatically build an ESM-Datastore for this experiment, if no catalogue is found there.
Open the catalog found there, if there already exists a catalog in the directory.
If outputs are moved off scratch, the catalog will break (the catalog.json file contains an internal reference to it's location & moving a catalog therefore breaks it). In such cases, it would be nice to rebuild the catalog.
Hooks to link this functionality to. I think this would in practice just look like an entrypoint that allows us to call this functionality from a bash script.

This might looks something like the following:

# Generate a datastore for an experiment which hasn't been catalogued yet
$ generate-esm-datastore --builder AccessOm2Builder --experiment-dir $DIR 

Generating esm-datastore...
Datastore successfully written to $DIR/catalog.json!

# Try to generate a datastore for the experiment which we just catalogued
$ generate-esm-datastore --builder AccessOm2Builder --experiment-dir $DIR 

esm-datastore found in $DIR, verifying datastore integrity...
Datastore integrity verified, aborting build

# Move the experiment datastore & catalogue (ie. off scratch)
$ cp -r $DIR $NEWDIR && cd $NEWDIR
$ generate-esm-datastore --builder AccessOm2Builder --experiment-dir $NEWDIR 

esm-datastore found in $NEWDIR, verifying datastore integrity...
Datastore broken due to path inconsistency, regenerating datastore...
Datastore successfully written to $NEWDIR/catalog.json!

From within python, we could then have a convenience function that looks something like:

>>> from access_nri_intake import use_esm_datastore
>>> my_datastore_path = "/scratch/abc/xyz/etc/experiment_dir/"
# Generate a datastore for an experiment which hasn't been catalogued yet
>>> use_esm_datastore(experiment_dir = my_datastore_path)
Generating esm-datastore...
No builder supplied - please supply one of `AccessOm2Builder`,...

>>> esm_ds = use_esm_datastore(experiment_dir = my_datastore_path, builder = AccessOm2Builder)
Generating esm-datastore...
Datastore successfully written to /scratch/abc/xyz/etc/experiment_dir/catalog.json!

>>> esm_ds
$EXPERIMENT_NAME datastore with $X dataset(s) from $Y asset(s):

# Run it again on the same dir:
>>> esm_ds = use_esm_datastore(experiment_dir = my_datastore_path, builder = AccessOm2Builder)
esm-datastore found in /scratch/abc/xyz/etc/experiment_dir/, verifying datastore integrity...
Datastore integrity verified, aborting build

# Move it, run again
!cp -r $DIR $NEWDIR
>>> my_new_datastore_path = "/home/abc/xyz/etc/experiment_dir/"
>>> esm_ds = use_esm_datastore(experiment_dir = my_new_datastore_path, builder = AccessOm2Builder)
esm-datastore found in /home/abc/xyz/etc/experiment_dir/, verifying datastore integrity...
Datastore broken due to path inconsistency, regenerating datastore...
Datastore successfully written to /home/abc/xyz/etc/experiment_dir/catalog.json!

>>> esm_ds
$EXPERIMENT_NAME datastore with $X dataset(s) from $Y asset(s):

@chrisb13 @anton-seaice are you able to confirm this is the sort of functionality we're after?

The text was updated successfully, but these errors were encountered:

marc-white · 2024-11-28T00:47:26Z

I like this idea, with a few things related to the wider access-nri-intake-catalog ecosystem:

We need to be super-explicit to users that running generate-esm-datastore won't add the experiment to access-nri-intake-catalog
On the flip side, it would be cool if part of the output of generate-esm-datastore is the instructions for getting the experiment added to access-nri-intake-catalog, including auto-generating the lines that would need to be added to the config YAML
Users may also appreciate it if the output of generate-esm-datastore gives them the basic Python command they need to open the datastore

anton-seaice · 2024-11-29T03:11:01Z

Thanks @charles-turner-1

I think the builder can be determined automatically from the model field in metadata.yaml

The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date.

charles-turner-1 · 2024-12-01T22:46:29Z

Cheers for the feedback @anton-seaice. Couple of things just to clarify:

I think the builder can be determined automatically from the model field in metadata.yaml

This would only be in the case of regenerating a datastore right - I'm assuming the metadata.yaml in this instance is the one that gets created as part of the catalog? I think it should be straightforward to implement regeneration without specifying a builder.

The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date.

I'm not quite sure I understand the logic here - if we regenerate the datastore every time the run is extended, then surely the datastore will stay up to date? Or is extending a model run different from the initial run in a way that makes this nontrivial?

anton-seaice · 2024-12-02T00:25:02Z

This would only be in the case of regenerating a datastore right - I'm assuming the metadata.yaml in this instance is the one that gets created as part of the catalog?

It's made by payu as long as the option to make it is on.

https://payu.readthedocs.io/en/stable/usage.html#metadata-files

The logic about determining if you need a new intake-esm datastore or not may be as complicated as building a new datastore. For example, if the datastore is generated everytime the model is run, then everytime a model run is extended, the datastore ends up out of date.

I'm not quite sure I understand the logic here - if we regenerate the datastore every time the run is extended, then surely the datastore will stay up to date? Or is extending a model run different from the initial run in a way that makes this nontrivial?

I guess that may not be totally robust though - folks will use old payu versions or configurations which don't update the datastore ?

i.e. in this case:

# Run it again on the same dir:
>>> esm_ds = use_esm_datastore(experiment_dir = my_datastore_path, builder = AccessOm2Builder)
esm-datastore found in /scratch/abc/xyz/etc/experiment_dir/, verifying datastore integrity...
Datastore integrity verified, aborting build

are their feasible ways to confirm the integrity without re-making the whole datastore ?

charles-turner-1 · 2024-12-02T00:41:22Z

Ahh, gotcha.

Since we're passing the outputs path to this utility function, I think it should be possible to run a subset of the datastore building pipeline in order to work out the expected time bounds. We should be able to work out if the catalog is only indexing a subset of the model outputs from this.

I suspect that this might be a bit slow to run if we naively index the whole thing but we can probably make it fairly efficient if we put in some relevant information about how to outputs are structured.

Do you think that would address the issue?

aidanheerdegen · 2024-12-02T00:46:06Z

Since we're passing the outputs path to this utility function, I think it should be possible to run a subset of the datastore building pipeline in order to work out the expected time bounds. We should be able to work out if the catalog is only indexing a subset of the model outputs from this.

Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated? If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

charles-turner-1 · 2024-12-02T00:57:16Z

Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated?

Yeah, this is what I had in mind - I think it should be possible & relatively straightforward to implement .

If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

This strikes me as a better solution - it will certainly be faster for the user. I don't think there is any potential for false negatives here - the catalog scans & indexes files on disk, so it can only index files created prior to the catalog.

Unless there are complicated things going on behind the scenes with Payu, I think this is the way to go?

chrisb13 · 2024-12-02T01:12:21Z

@charles-turner-1, thanks for following this up. Looks good, I like that it has both a cli and python module interface. If it's not too hard, @marc-white's additions sound great too.

Based on our chats I think we are all thinking this but it's not quite clear from the early post, that we want the ability to get the achieved path from payu if the user decides to not keep their data on scratch. I think there's still some clarity needed from our payu experts on how we choose to run this for some models and not others?

If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

This was my naive thinking of a solution but that's due to my naivety of how these catalogs are built!

charles-turner-1 · 2024-12-02T01:29:06Z

Yup, I think Marc's suggestions are great & would be pretty straightforward to implement.

Re. paths, I'm not sure I understand Payu well enough to say anything specific, but I think we should be able to make this feature path-agnostic internally, and just let the user/Payu pass a path to it. That way, Payu can build a default catalog (if desired?), and then if the user moves the data elsewhere, they can rebuild it straightforwardly using essentially the same tools, syntax, config, etc..

chrisb13 · 2024-12-02T02:01:52Z

Sounds great.

Also, this morning we were thinking it would be great if you could add the 'template for ACCESS-OM3 evaluation metrics' to this repo; discussed here. Actually, @anton-seaice , it's already been done!

Any feedback @anton-seaice? I think once the above is done, it would be good to update it and the cosima recipes.

anton-seaice · 2024-12-02T02:06:47Z

Also, this morning we were thinking it would be great if you could add the 'template for ACCESS-OM3 evaluation metrics' to this repo; discussed here. Actually, @anton-seaice , it's already been done!

Any feedback @anton-seaice?

replied in ACCESS-NRI/access-eval-recipes#5

anton-seaice · 2024-12-02T02:33:38Z

Would it be possible to do the step where it figures out what files would be indexed and compare against what is already indexed to know if it needs to be updated?

Yeah, this is what I had in mind - I think it should be possible & relatively straightforward to implement .

Sometimes people would re-run the model to add extra diagnostics. How slow would doing a checksum on every file be to catch this case ?

If that was a pain then just using modification times, i.e. is the catalogue older than the newest outputs, might be a useful first step?

This strikes me as a better solution - it will certainly be faster for the user. I don't think there is any potential for false negatives here - the catalog scans & indexes files on disk, so it can only index files created prior to the catalog.

Unless there are complicated things going on behind the scenes with Payu, I think this is the way to go?

The question I have about this one, is that sometimes people will touch every file in a folder on scratch to prevent it getting it deleted. Which might lead to a rebuild when its not needed? This is probably fine really ...

aidanheerdegen · 2024-12-02T06:50:27Z

Sometimes people would re-run the model to add extra diagnostics. How slow would doing a checksum on every file be to catch this case ?

Not awful, but even faster if something like binhash was used (the change detection hash payu uses in the manifest files).

However I don't think we want to encourage "re-running" a model. To my mind this would just be another experiment with its own catalogue. Re-running and overwriting an existing experiment seems kinda fraught and definitely isn't a supported mode of operation.

aidanheerdegen · 2024-12-02T06:54:32Z

The question I have about this one, is that sometimes people will touch every file in a folder on scratch to prevent it getting it deleted. Which might lead to a rebuild when its not needed? This is probably fine really ...

Apart from this being against NCI guidelines, we really want to encourage users to utilise the excellent sync capabilities @jo-basevi has written for payu to automatically synchronise outputs to /g/data or similar. It is pretty much the only safe way of operating on gadi with the current /scratch purge policy.

chrisb13 · 2024-12-03T03:52:51Z

Side note, once this has been rolled out, would be worth mentioning on the hive docs: https://access-hive.org.au/models/run-a-model/run-access-om/#access-om2-outputs

anton-seaice · 2024-12-03T03:58:39Z

To summarise my thoughts - just using file paths to determine intake-esm datastore correctness is probably ok. Doing something more robust would be better, and if binhash is fast, then this would protect against the source data files changing (even if thats not the intended / trained usecase).

charles-turner-1 added the enhancement New feature or request label Nov 27, 2024

github-project-automation bot added this to Model Evaluation & Diagnostics Nov 27, 2024

github-project-automation bot moved this to Backlog in Model Evaluation & Diagnostics Nov 27, 2024

charles-turner-1 self-assigned this Nov 27, 2024

charles-turner-1 mentioned this issue Dec 2, 2024

Make a recipe template ACCESS-NRI/access-eval-recipes#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cataloguing of individual experiments #274

Cataloguing of individual experiments #274

charles-turner-1 commented Nov 27, 2024

marc-white commented Nov 28, 2024

anton-seaice commented Nov 29, 2024

charles-turner-1 commented Dec 1, 2024 •

edited by anton-seaice

Loading

anton-seaice commented Dec 2, 2024

charles-turner-1 commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

charles-turner-1 commented Dec 2, 2024

chrisb13 commented Dec 2, 2024

charles-turner-1 commented Dec 2, 2024 •

edited

Loading

chrisb13 commented Dec 2, 2024

anton-seaice commented Dec 2, 2024

anton-seaice commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

chrisb13 commented Dec 3, 2024

anton-seaice commented Dec 3, 2024

Cataloguing of individual experiments #274

Cataloguing of individual experiments #274

Comments

charles-turner-1 commented Nov 27, 2024

Is your feature request related to a problem? Please describe.

Describe the feature you'd like

marc-white commented Nov 28, 2024

anton-seaice commented Nov 29, 2024

charles-turner-1 commented Dec 1, 2024 • edited by anton-seaice Loading

anton-seaice commented Dec 2, 2024

charles-turner-1 commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

charles-turner-1 commented Dec 2, 2024

chrisb13 commented Dec 2, 2024

charles-turner-1 commented Dec 2, 2024 • edited Loading

chrisb13 commented Dec 2, 2024

anton-seaice commented Dec 2, 2024

anton-seaice commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

aidanheerdegen commented Dec 2, 2024

chrisb13 commented Dec 3, 2024

anton-seaice commented Dec 3, 2024

charles-turner-1 commented Dec 1, 2024 •

edited by anton-seaice

Loading

charles-turner-1 commented Dec 2, 2024 •

edited

Loading