Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<version> #267

Open
wachsylon opened this issue Nov 14, 2017 · 23 comments
Open

<version> #267

wachsylon opened this issue Nov 14, 2017 · 23 comments
Assignees
Milestone

Comments

@wachsylon
Copy link
Collaborator

wachsylon commented Nov 14, 2017

Hi,
in https://cmor.llnl.gov/mydoc_cmor3_c/ you recommend to set:
"output_path_template": "<mip_era><activity_id><institution_id><source_id><experiment_id><_member_id><table><variable_id><grid_label><version>"
where the version has the format "v<date>". If a User integrates CMOR in the operational workflow in which model output is produced over a time period of several days, model output will have different <version>s although it originates from the same model run. This has lead to some confusions. How should a user find a file which contains a particular variable in this structure during the workflow? Furthermore, the directories needs to be merged and a unique version needs to be assigned to the whole experiment output eventually.
I suggest three solutions:

  • Disable CMOR to construct <version>. This depends on how often CMOR is used for small experiments completely finished within a day.
  • Only to remove <version> from output_path_template when performing operational experiment simulations and adding the subdirectory before publication. If the <version> build rule remains, I think this operational issue needs to be pointed out in the documentation for the users.
  • The other ( probably unrealistic ) solution is to change the build rule of <version>.

What is your opinion? Did I miss something? Is there another reason for the <version> build rule?
Best regards,
Fabi

@durack1
Copy link
Contributor

durack1 commented Nov 14, 2017

@wachsylon an option here would be to hardcode the version identifier, and in that case the line above would become:

 "output_path_template":"<mip_era><activity_id><institution_id><source_id><experiment_id><_member_id><table><variable_id><grid_label>v20171114"

You'd have to check to make sure that this correctly parses the path (that "v20171114" is a new subdir beneath <grid_label>

When calling CMOR, you could generate your json input to replace and update the tag each time

@taylor13
Copy link
Collaborator

taylor13 commented Nov 14, 2017

@wachsylon -- For CMIP6 the decision was made to independently assign versions at the granularity of individual "atomic datasets" (i.e., a single version number gets assigned to the full time-series of a single variable resulting from a single simulation). Thus, multiple version numbers will usually be associated with a single simulation. There is no plan to assign a single version number at a coarser granularity (e.g., all output from a single simulation). One reason for this is that if a mistake is found in a single variable, it can be withdrawn/updated without requiring republication of the rest of the output from that simulation.

ESGF requires, however, that for an individual atomic dataset, all files (making up that datset) be put in a single directory (identified with the version number for that atomic dataset). By default this won't happen if with CMOR you write some files on one day and others on another.

To ensure all files in the atomic datset are put in the same "version" subdirectory, please see the discussion at #210 (comment) (beginning at the July 25 entry, and reading through to the end). There you can see how to make sure all files in a single atomic dataset can be assigned the same version number even if they are written on different days.

@wachsylon
Copy link
Collaborator Author

wachsylon commented Nov 17, 2017

Thanks for the quick, detailed and helpful responses, I think I will use the option @durack1 proposed. I also will discuss this next week, I hope this issue can remain open for a few more days in case I have more questions.
Best regards,
Fabi

@durack1
Copy link
Contributor

durack1 commented Nov 17, 2017

@wachsylon the issue isn't a problem with the CMOR code base, rather it's how users are providing inputs to CMOR so I will close, but feel free to continue to raise queries through this issue, we'll still see your questions and respond

@durack1
Copy link
Contributor

durack1 commented Apr 17, 2018

@dnadeau4 @taylor13 we've hit this problem with the input4MIPs data re-writing, it may be useful to consider checking hard-coded versions (e.g. v20180417) and throwing a warning if this is not within some threshold of the current date (is 3 days reasonable?), I believe the relevant code is found at https://github.com/PCMDI/cmor/blob/master/Src/cmor.c#L5514-L5518

@durack1 durack1 reopened this Apr 17, 2018
@durack1 durack1 added this to the 3.5.0 milestone Apr 17, 2018
@taylor13
Copy link
Collaborator

Some relevant discussion also is found at #210 (comment) (begin at the July 25 entry, and read the following 2 comments too). We could help users by implementing 5 changes:

  1. Describe how definition of "_myVersion" in the path template will override the CMOR generation of version numbers.
  2. Discourage users from overriding the "preserve" directive governing files. This will help alert them when they might need to specify a new version number. With "preserve" set, CMOR will error exit if a file already exist, so the user would know they should either create a new version label (presumably with a later date), or if they haven't yet published the old version, they might want to delete it and then rerun CMOR.
  3. When CMOR processes a path with a user-specified version number, CMOR should check that the date specified is within say three days of the actual date. If it isn't, CMOR should raise a warning:
"WARNING: The version assigned to this dataset,", _myVersion, "is a date that differs by more 
than 3 days from today's date,", todaysDate, "Normally this indicates an error in 
specifying the date, which should be corrected in CMOR_input.json.  Note that 
all files in an atomic dataset (all time-slices from a single variable produced by a 
simulation) must be assigned the same version number." 

[not sure the name of the input file is correct in the above warning message.]
4) When CMOR processes a path with a user-specified version number, CMOR should check that the date specified is within say 4 weeks of the actual date. If it isn't, CMOR should raise an error and exit:

"ERROR: The version assigned to this dataset,", _myVersion, "must be within 4 weeks 
of today's date.  Please either rely on CMOR to generate the version number, 
or assign a version number that is within 4 weeks of today.  Note that 
all files in an atomic dataset (all time-slices from a single variable produced 
by a simulation) must be assigned the same version number.  If you wish to
append additional time-slices to existing output written more than 4 weeks
ago, you should rename the existing version subdirectory with a more recent 
date.  This will  allow CMOR to add additional files to that version without 
raising this error."
  1. Before writing a file, CMOR should check whether another version subdirectory exists in the path that is within 3 days of the current date. If it does and if it doesn't include the file about to be written, CMOR should raise a warning:
"WARNING: For the variable you are writing, an atomic dataset has already 
been defined and assigned version ", otherVersion, ".  For the file you are now 
writing, CMOR has assigned today's date for the version.  If this file should
actually be grouped with the files already written in ", otherVersion, " you  
should modify the output_path_template in the CMOR_input.json file, by 
replacing <version> with the existing subdirectory name (i.e., ", 
otherVersion, "), and rerun CMOR"

@taylor13
Copy link
Collaborator

taylor13 commented Oct 8, 2018

We have received the following email:

I got datasets from GISS that files of same variables are in different versions for example 
 
/v20180912/sfcWind_Amon_hist-sol_GISS-E2-1-G_r3i1p1f1_gn_185001-190012.nc
/v20180912/sfcWind_Amon_hist-sol_GISS-E2-1-G_r3i1p1f1_gn_190101-195012.nc
/v20180912/sfcWind_Amon_hist-sol_GISS-E2-1-G_r3i1p1f1_gn_195101-200012.nc
/v20180913/sfcWind_Amon_hist-sol_GISS-E2-1-G_r3i1p1f1_gn_200101-201412.nc
 
They are supposed to be published together.     When I run esgmapefile the only late date
 version were scanned and added into mapfile (here 20180913).   So if I publish the map file,
 I only publish  1/4 of the files.
 
According to GISS "The files have two date versions because some files are generated on
 two different days.  This will happen more often in the future because the experiments we
 going to do will generate more files and will take more than one day for CMOR (Climate
 Model Output Rewriter) to generate the files.  So, the publication software need to be able
 to handle this."
 
What can I do ?  

I think we should implement the warning and error messages suggested in #267 (comment) now to help guard against these problems.

Can the priority of this issue be raised?

@taylor13
Copy link
Collaborator

taylor13 commented Oct 9, 2018

@mauzey1 Would you please prioritize implementing these changes? Thank you.

@taylor13
Copy link
Collaborator

taylor13 commented Oct 10, 2018

I'm transferring an email thread to here:

Here are some questions I have.
 
I tried running some of the tests using CMOR_PRESERVE twice to see what happens.  Instead 
of exiting with an error, it appends the extension ‘.copy’ to the old file’s name and then creates 
a new file using the old name.  Is this the behavior we expect in the current version?  I tested 
this using the latest master on my Linux workstation.
 
For the version numbers set by the user, do those have to be valid date strings with the format 
‘v%Y%m%d’ like v20181010?  What should happen if it is not in that format?  When running 
tests, I was able to make a version directory named ‘myVersion’.
 
For the ‘within 3 days’ and ‘within 4 weeks’ rules, what should happen if a version number is 
a future date?

And my responses:

Concerning the "CMOR_PRESERVE" behavior, we should ask @dnadeau4 if he remembers what was intended. The CMOR documentation at https://cmor.llnl.gov/mydoc_cmor3_api/ says

 If the value is CMOR_PRESERVE, a new file will be created unless a file by the same name 
already exists, in which case the program will error exit.

We should either correct the documentation or change CMOR's behavior to be consistent with it. Again, we should consult @dnadeau4 because I vaguely remember someone requesting a change in behavior from the original error exiting.

Concerning a user-set version number, I think it may be o.k. if CMOR allows this when the file is being written, but when the PrePARE part of the code executes after the file is written, it should raise an error if the version number is inconsistent with the CMIP6 template (‘v%Y%m%d’). This presumes that PrePARE is run following each CMOR execution.

Concerning version dates being in the future, I think we should not allow this (i.e., we should raise an error).

@wachsylon
Copy link
Collaborator Author

Why do we need these strict rules?

I understood that version is needed for the publication because, if an error is found in the published simulation, it should be possible to publish new versions of it. With this in mind, I think the only strict rule is that the version date is not beyond the date of publication. CMOR is not able to know when the files should be published. So the only required warning is that if the publication date is in the future, the user needs to be informed that the files cannot be published until then.

Here it says that the version is "indicating approximate date of model output file". If there is a simulation done in 2017 which should be prepared for CMIP6 in 2018 with a version <v2017..>, why should CMOR exit for this? Also, there can be high resolution models whose simulations take a long time to be finalized.

When it comes to the technical implementation, I think that the proposed checks and errors require an attribute for that should be checked. Right now, I believe that the technical implementation of such warnings in CMOR would probably be made by checking whether <ouput_path_template> is equal to the CMIP6 requested one but with an individual (tell me if I am wrong). If we assume that this is possible and CMOR gives errors for some cases,

  1. the user is mislead to enhance the manipulation of the DRS until it fits.
  2. there is the question why to provide e.g. <experiment_id> in the output_path_template instead of amip because the output_path_template is checked either way.

@ehogan
Copy link
Contributor

ehogan commented Oct 16, 2018

@taylor13 are you remembering the discussion we had about CMOR_PRESERVE on #246? :)

@mauzey1
Copy link
Collaborator

mauzey1 commented Oct 16, 2018

When warning the user that the same file is present in another version that it is currently not writing to, should CMOR warn multiple times if the same file is found in multiple versions? Should this warning ignore version numbers that don't follow the CMIP6 template (‘v%Y%m%d’)?

So CMOR should be able to use whatever version number the user specifies but PrePARE should raise an error if a version number is not in the correct format or is a future date?

@taylor13
Copy link
Collaborator

@ehogan I have now read through #246, but I'm not sure what I should be paying attention to. I can see they are somewhat related. In particular the CMOR3 documentation and error messages may be inconsistent:

We should either correct the documentation or change CMOR's behavior to be consistent with it. Again, we should consult @dnadeau4 because I vaguely remember someone requesting a change in behavior from the original error exiting.

Can you please confirm that the documentation is wrong?

@taylor13
Copy link
Collaborator

@wachsylon : I didn't follow the last part of your comment ("technical implementation"). If it is important, could you please re-explain?

Regarding future dates, you've made a good case to allow future dates. I'll discuss with @doutriaux1 and @mauzey1, and then propose an algorithm.

@wachsylon
Copy link
Collaborator Author

How should version be checked? There is no attribute for it. The only way is by evaluating output_path_template.

But I do not want to point this out to users because the users may be misled into changing the output_path_template completely. Those changes prevent CMOR from checking the actual setting.

@taylor13
Copy link
Collaborator

The above discussion concerns adding additional files to those already written (and all having been assigned a common version number. The check is to ensure that the "additional files" should have the same version number as the already-written files.

I don't think CMOR should check "version" for the first file published in a series. Do you agree?

@mauzey1
Copy link
Collaborator

mauzey1 commented Oct 24, 2018

Should CMOR retain the version number generated upon writing its first file, and then use that version for subsequent files while CMOR is still running?

@taylor13
Copy link
Collaborator

Not necessarily. If additional files are being added to those already written, the version number should be specified by the user ( in "_myVersion" ). Then CMOR should check that this version (date) is consistent with the existing files.

If the user doesn't pass CMOR a version number, CMOR should set the version number to the current date and CMOR should perform the checks described in #267 (comment) above.

For CMIP6, PrePARe should always check that version follows the template (‘v%Y%m%d’) (and it should do this as part of the CMOR execution as well as during publication).

I now think (following above discussion) that the "within 3 days" and "within 4 weeks" rules should allow dates in the future as well as the past.

@taylor13
Copy link
Collaborator

Let's hold off on coding these changes. I spoke with Denis and we might have an alternative approach.

@wachsylon
Copy link
Collaborator Author

wachsylon commented Oct 25, 2018

I don't think CMOR should check "version" for the first file published in a series

I agree that CMOR should search for other versions and only tell the user if it finds some.

I can not set an arbitrary version by specifying "_my_version":"temporary" in the cmor_dataset function, can I?

If that is, why not introducing such a keyword "my_version" which, if it is specified in the cmor_dataset file, is checked with your proposed tests while, in case the output_path_template is changed, CMOR checks are switched off. I understand manipulating output_path_template as a method to ignore CMIP6 requests. For example: If I change
"output_path_template":"<variable_id><grid_label><version>"
to
"output_path_template":"myVariable/myGrid/temporary/"
CMOR should not give warnings about myVariable as well, should it?

@mauzey1
Copy link
Collaborator

mauzey1 commented May 13, 2019

@taylor13 @durack1 @doutriaux1
Is this a feature that we are still interested in adding to CMOR? Have we settled on how the algorithm for checking the version number should work?

@durack1
Copy link
Contributor

durack1 commented May 13, 2019

@mauzey1 I am not close enough to know what @taylor13 had in mind here, so am happy to contribute to a discussion, but am not up-to-date.

@zklaus
Copy link

zklaus commented Jun 28, 2019

We are cmorizing cmip6 data for EC-Earth and are running into the following issues:

  • We relatively often get at least two versions for one simulation because the cmorization runs across midnight.
  • We will probably publish at least some of our longer (~800 yrs piControl) runs in chunks, meaning a couple of hundred years now to support the historical experiments, a couple of hundred years later this year for longtime model behavior studies. According to CMIP6 standards, all time steps of one variable must have the same version, so we will cmorize and publish with a version that might be a couple of months in the past.

Our approach to this is at the moment to sort out the version number after the cmorization process with scripts and for simplicity to keep the same version for all variables in one experiment.

Should we need to fix some data later on, we will likely give a new version only to those variables that have been changed.

For us, just being able to specify the version manually would be a great help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants