Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support localization of natural language attributes and variables #528

Open
turnbullerin opened this issue Jul 8, 2024 · 32 comments
Open
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@turnbullerin
Copy link

turnbullerin commented Jul 8, 2024

Moderator

TBD

Moderator Status Review

None

Requirement Summary

Metadata includes natural language text in several places, notably the title and long_name attributes, as well as potentially in character data variables. Other metadata standards, such as ISO-19115, support the translation of these variables and translation is mandatory in some places such as in files generated by the Canadian Government. By standardizing how these elements are specified in a fashion that is both human-readable and machine-readable, users can identify metadata in their preferred language more easily and computer applications can display metadata to match users preferences and, where this is not possible, then at least while using appropriate accessible techniques. Of key importance is also compatibility with applications such as ERDDAP, which is an application that uses NetCDF files following the CF conventions to create a web interface to select and download data. For this reason, we decided not to use the new .fr-CA suffix as a required format as it would not be compatible with ERDDAP - instead, data providers are free to choose suffixes that meet their use case.

Technical Proposal Summary

Based on discussions in cf-convention/discuss#244, the following proposal seemed acceptable: (1) the creation of a new global attribute that maps suffixes to BCP 47 language tags as well as specifying the default language tag in the file, (2) designating that any attribute or data variable with such a suffix is a localized version of the text in the non-suffixed attribute or variable.

Benefits

Data producers who are required to produce metadata or data in multiple languages, applications that offer multilingual interfaces for viewing or manipulating NetCDF data based on the CF standards, data users who wish to access metadata in the language of their choice

Status Quo

Currently no NetCDF standard offers a standard for localized metadata. However, such standards existing in other metadata formats, such as ISO-19115.

Associated pull request

Not present yet

Detailed Proposal

The addition of a new section to the CF conventions that specifies the following:

  • The use of the global attribute localizations, which will be a space-separated list of paired suffixes and BCP 47 language tags (similar to how cell_methods is formatted): for example :localizations = "default: en-US _fr: fr-CA _es: es-MX";
  • The use of the special word default instead of a suffix to indicate the default locale of the document
  • A note that the use of localized attributes is not limited to CF specified attributes but may be used with non CF attributes as well
  • Recommendation that the most complete set of metadata/data is used as the "default" or, all else being equal, the original language the text was produced in
  • Where the localizations attribute is present, attributes and variables may not be named with a suffix except that they indicate localized versions of non-suffixed attributes or variables
  • A recommendation for applications to follow BCP 47 in determining which version of localized content to show the user
  • An example of a localized global and variable attribute
  • An example of a localized data variable

In addition, the following changes would be proposed:

  • An addition to Appendix A to document which CF attributes are intended to contain natural language text (currently: title, comment, institution, long_name, references, sources and to be discussed history and flag_meanings)
  • An addition to sections 2.5 and 2.6 to let readers know that localization is available and refer them to the new section
@turnbullerin turnbullerin added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Jul 8, 2024
@turnbullerin
Copy link
Author

Here is the draft text we were looking at in the discussion thread

ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)

Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so.

ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)

Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so.

NEW SECTION

TBD. Localization

Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as en-CA for Canadian English in the default script. This section defines the standard pattern for localizing the contents of a NetCDF file.

Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag.

TBD.1 Localized Files

A "localized file" is one that provides the global attribute localizations. If present, the attribute must contain a space-delimited list words in the format suffix: language_tag. For example, the string default: en _fr: fr-CA _es: es-MX specifies that the default locale of the file is en, that the suffix _fr indicates content in the fr-CA locale, and that the suffix _es indicates content in the es-MX locale. Suffixes may be any text string allowed in an attribute or variable name, but it is strongly recommended that they be chosen for clarity by making them clearly associated with the locale they represent.

The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen.

An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section.

Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used.

TBD.2 Localized Attributes

Localized attributes are created by appending a locale suffix to the usual attribute name. For example:


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";
    :title = "English Title";        // English title 
    :title_fr = "Titre française";   // French title
    :title_es = "Título en español"; // Spanish title
    :summary = "English Summary";
    :summary_fr = "Sommaire française";
    // omitted Spanish summary means English will be used instead

    double salinity(i);
    salinity:long_name = "Salinity";
    salinity:long_name_fr = "Salinité";
    salinity:long_name_es = "Salinidad";

TBD.3 Localized Variables

Localized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e. weather_obs[0] is the English text and weather_obs_fr[0] is the French text of the same value).


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";

    char weather_obs(i);
    weather_obs:long_name = "Weather Conditions";

    char weather_obs_fr(i);
    weather_obs_fr:long_name = "Observations Météorologiques";
    
    char weather_obs_es(i);
    weather_obs_es:long_name = "Observaciones Meteorológicas"

data:
    weather_obs = "sunny", "rainy", ...;
    weather_obs_fr = "ensoleillé", "pluvieux", ...;
    weather_obs_es = "soleado", "lluvioso", ...;

ADDITION TO APPENDIX A

  • Add a column for "Locale-Aware" (Y or N) or maybe add a new data type of S for non-locale-aware string and S-L for locale-aware string?
  • Locale-aware string attributes:
    • comment
    • flag_meanings (? they have underscores but would be helpful ?)
    • history (? I feel like this will be complex since they are automatically updated but having a translated version of the history would be helpful ?)
    • institution
    • long_name
    • references
    • source
    • title

References
https://www.rfc-editor.org/info/bcp47
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

EDIT NOTES:

  1. I added a note that variables must be the same type, dimensions, and size as each other
  2. I noted the format of cell_methods and updated mine to match

@sethmcg
Copy link
Contributor

sethmcg commented Jul 8, 2024

This looks good. Nice work!

flag_meanings seems like it ought to be locale-aware.

I think it could be hard to make history locale-aware. The history attributes in the files I work with mostly consist of sequences of command-line invocations, which you don't want to translate because then it's not the command that was actually used. When there is a comment, it's usually something that a piece of software added automatically, rather than the file creator, so there's no control over what language it's in. Depending on the tools you use, it seems like it wouldn't be hard to end up with a history attribute that's multi-lingual or in a different language than the one you want to declare as the default.

@DocOtak
Copy link
Member

DocOtak commented Jul 9, 2024

I'm struggling with this for reasons I'm having trouble articulating (and am not sure are even valid reasons), I'll try.

I don't think CF should introduce attribute and variable name modifications as a concept. Section 2.5 starts out with:

This convention does not standardize variable names.

With this proposal, it would be except in cases of localization. This proposal also places restrictions on variable/attribute names and their suffixes.

I think the ERDDAP "elephant in the room" is important and relevant here. For ERDDAP adoption in Canada, it needs to display localized content, so this proposal must work in a real world application. I'm fearful that this will not work in ERDDAP for some reason we don't know and won't know until an implementation is worked on and the standard has already been codified.

Would there be willingness on the ERDDAP side to implement something that isn't quite in CF yet? And on the CF side, would we be willing to codify whatever ends up working best in ERDDAP? Realizing that someone involved in both communities might need to actually do the coding.

@rmendels
Copy link

rmendels commented Jul 9, 2024

@DocOtak @turnbullerin As I told Erin, the best way to help see if this can be done is to contribute code to ERDDAP to do so and be tested. There are ready-to-go development environments, compiling and testing have been simplified. Chris is only part-time, but I am sure it would be happy to work with people on this.

@turnbullerin
Copy link
Author

For an update, the ERDDAP issue now has a link to a working prototype I developed based on this proposal for the title attribute - see ERDDAP/erddap#114

@sethmcg
Copy link
Contributor

sethmcg commented Aug 26, 2024

This is heading towards resolution and I don't want to impede things if the answer is 'no', but over in discussion #341 we're talking about whether string arrays should be allowed, so I just want to ask the question:

Would it be dramatically simpler to solve this problem if we used an array of strings for different localizations of free-text attributes?

The current solution is effectively creating an ad hoc array of attributes by appending suffixes to the attribute name. Would it make things easier on the implementation side to have a for-real array of attribute values with, say, a localization prefix (e.g., [fr-CA]: at the beginning of each string?

My guess is that the answer on the ERDDAP side is probably 'no', and since that's the primary driver here, ERDDAP-viability is paramount, so if that's the case, I think things should proceed as if I had not commented at all. But I wanted to bring it up before we get locked in to a solution, just in case the alternative would make everyone's lives a lot easier.

@larsbarring
Copy link
Contributor

Interesting idea -- thanks for a potential use case!

But I fully agree with you @sethmcg that for this particular issue of localization should we go with what most effectively and with least effort move this forward towards conclusion.

@DocOtak
Copy link
Member

DocOtak commented Aug 27, 2024

@sethmcg @larsbarring

I think more progress will be made at the CF Workshop hackathon. I'd personally like to revisit/present again a container variable solution that does not need any string parsing or manipulation, other than the BCP 47 language tag which tends to have library support.

I'm hopeful I'll be able to attend that workshop in person.

@larsbarring
Copy link
Contributor

Andrew, @DocOtak, it would be great if you were able to come in person!
Just a friendly reminder to register for the event so that the security clearance formality can be sorted (and please register even if attending only virtually :-)

@DocOtak
Copy link
Member

DocOtak commented Aug 28, 2024

@larsbarring filled out the form yesterday, hopefully it's all ok.

@larsbarring
Copy link
Contributor

All good -- thanks! /Lars

@Dave-Allured
Copy link
Contributor

I suggest that a container solution will be workable, and will have about the same complexity compared to the original concept for multiple attribute names with suffixes. I favor suffixes over container, for readability and transparency. By transparency I mean that the meaning of suffixes would be obvious to the casual user, with only the knowledge that the suffixes are BCP 47, and no other details from the CF document.

One particular advantage of suffixes is that you get backward compatibility with no complications. If you have e.g. title and title.fr, then existing naive applications will continue to operate normally and display title, with no awareness nor interference from alternative attributes. With a container, I am afraid that you would need either duplicate copies of the default string, or else some other awkward device to tell the container to look elsewhere for the default string.

@Dave-Allured
Copy link
Contributor

Erin, you have put a lot of thought and care into your current proposal seen above. What you have envisioned is quite workable. However I recommend simplification. We have discussed some of this before.

  • "... instead, data providers are free to choose suffixes that meet their use case" No. CF needs to choose a single concise solution, not try to please everyone including me.

  • Omit the localizations attribute, and the inventorying and indirection that it represents. Or at least make it optional. Code BCP 47 abbreviations directly into whatever tagging mechanism is chosen. E.g. title.fr-CA = "Titre française". The same can be done with containers. As I have said before, file scanning to build an inventory is easy and efficient.

  • Omit section TBD.1 Localized Files. Allow independent localization for each object in a Netcdf file. This does not block any external requirement for a data set to be "localization complete".

  • Move localization of variables to a separate discussion, and focus on attributes only. This is only a mechanical suggestion because the attributes discussion is hard enough, and variables have different constraints and considerations. I have a nice alternative suggestion for variables when the time is right.

@DocOtak
Copy link
Member

DocOtak commented Sep 24, 2024

I was really hoping to participate in the hackathon for this. I had even managed to get the ERDDAP source working on my computer but couldn't do much more than build/run it. I'm not a Java person and was really hoping for help on this. While at the hackathon, I was planning (see above thread) to have a container variable possible solution implemented in ERDDAP that could be compared to the one done already. I'm motivated by two things:

  1. Encoding meaning in the attribute keys themselves is entirely new/foreign to CF and is something I think we must avoid introducing.

  2. I strongly feel that CF should avoid introducing new attributes who's values require custom parsing algorithms. The complexity parsing the strings in e.g. cell methods I think was a misstep, and in very casual conversation at the workshop (I think over beers in the evening), the feeling was these are easy to write, but very difficult to parse and use as someone receiving the data. CF should trend toward attributes whos values are either exactly defined by CF in an enumeration, even a massive one like the standard_name, or is defined by an external standard that we don't change/extend (like BCP 47). Even in the example ERDDAP implementation some of the difficulties of dealing with custom parsing was noted in the a comment:

    // TODO: we should enforce a better removal of ":" to ensure the string is well formatted"

    A container variable was just added for describing lossy compression via quantization.

I think I made some example files a year or two ago, but IIRC they were complex to show capabilities. I'll try to make some simple files soon and show them here.

@Dave-Allured
Copy link
Contributor

One particular advantage of suffixes is that you get backward compatibility with no complications. If you have e.g. title and title.fr, then existing naive applications will continue to operate normally and display title, with no awareness nor interference from alternative attributes. With a container, I am afraid that you would need either duplicate copies of the default string, or else some other awkward device to tell the container to look elsewhere for the default string.

I was mistaken, so I retract this comment. This backward compatibility could be done exactly same way with either containers or suffixes. Either way, I expect there would normally be a traditional attribute in the default locale (to be determined elsewhere). Then the "extended attributes" would consist ONLY of the non-default values. Here is a simple imagined container example.

:title = "English Title";
:title_localized = "fr-CA:Titre française; es-MX:Título en español";

@turnbullerin
Copy link
Author

turnbullerin commented Sep 25, 2024 via email

@JonathanGregory
Copy link
Contributor

Dear Erin

Sorry for your illness; I'm sure we all hope you recover soon.

In Sect 2.3 of the working draft, we quite recently added, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." Therefore .fr-CA in an attribute name would not go against CF recommendations. If I remember correctly, this was done to support the present issue.

Best wishes

Jonathan

@larsbarring
Copy link
Contributor

larsbarring commented Oct 9, 2024

Erin,

I hope that your situation is improving.

I just want to make a couple of comments in relation to your status update.

  • Is the ERDDAP folks' resistance towards a .fr-CAsolution simply based on that the period and hyphen is not recommended but allowed, or are there deeper concerns regarding interoperability? I believe the specific wording was more of a mistake than a deliberate choice. And we are right now discussing updating which characters (Unicode codepoints) are recommended, allowed and disallowed, as well as making the distinction between these categories more clear.
  • You mention that __fr_CA is on the table, and I just want to draw attention to the fact that double underscores (Unicode low_lines) already have special meaning in relation to OGC netCDF-LD, specifically for prefixes. As I remember the previous discussions on localization I think we mainly thought of the tags as suffixes. As have no knowledge of OGC netCDF-LD I cannot judge if these two uses of double underscore are consistent with each other, or whether they might be contradictory thus creating interoperability clashes.

Ping @ChrisJohnNOAA

@ChrisJohnNOAA
Copy link

The main concern I have from ERDDAP is that we allow for exporting of data in many different formats intended for use in a variety of programing languages, both of which can have limitations on what characters can be used in variable names. If this suffix is included in the data files that ERDDAP exports, we are severely limited in what characters can be included.

If you need more context than that, this part of the ERDDAP discussion might be useful: ERDDAP/erddap#114 (comment)

@Dave-Allured
Copy link
Contributor

both of which can have limitations on what characters can be used in variable names.

@ChrisJohnNOAA, okay. How about we focus on attribute names only, and skip variable names for now? By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default. Will that be okay for ERDDAP?

@DocOtak
Copy link
Member

DocOtak commented Oct 9, 2024

Here is my "Container Variable" proposal for localizations

Define two new attribute:

  • locale - countians a single BCP 47 language tag
  • localizations - char array with a space separated list of variable names containing other available localizations

Localization data is contained in a container variable, like the existing geometry container and the newly added quantization container variables. Localization containers must contain a locale attribute with a BCP 47 language tag. All other attributes on this container variable are localized versions of attributes in the referencing scope (global or variable).

Why I like this:

  • It exactly defines two new attributes that can be looked for and added to the attribute tables.
  • The entire value of the locale attribute is defined well by an external standard
  • The entire value of localizations attribute is using well established CF convention of a space separated string (e.g. coordinates, ancillary variables)
  • It uses a now well established mechanism of container variables for extra data (geometry and quantization parameters)
  • All versions of a localized attribute have the same name.
  • Parsers for BCP 47 strings seem to be available for many languages.
  • It does not potentially conflict with OGCs netCDF-LD
  • It does not require any additional characters to be allowed as either variable or attribute names.
  • It does not require parsing/regex/etc of attribute or variable names, only simple exact matching.
  • It does not require parsing/regex/etc of a prefix/suffix mapping to BCP 47 strings

The last three bullet points I view as a security feature that might be relevant when implementing in something like ERDDAP that I think has some US Government requirements imposed on it.

Here is a simple example for what this looks like in a full CDL:

netcdf locale_example {
variables:
	double alt_locale ;
		alt_locale:locale = "fr-CA" ;
		alt_locale:title = "Titre en français" ;

// global attributes:
		:locale = "en-CA" ;
		:title = "English Title" ;
		:localizations = "alt_locale" ;
data:

 alt_locale = NaN ;
}

If the users locale was Canadian French, the attributes from the available Canadian French localization container variable would replace those in the global attributes. BCP 47 defines some algorithm on quality weights and how to fall back to find a localized string, we must define a default behavior if no matching locale is found, which for us would be the locale that is not in a container variable. In the above example, the default locale is en-CA. If a variable referenced by localizations does not exist in the netCDF, it should be ignored.

For ERDDAP think the combination of combinedGlobalAttributes and the contents of the edv array should allow the construction of a data structure that could hold all the localization data and the java.util.Locale class looks like it can handle all the matching and parsing. Since this proposal uses the BCP 47 language tags without any additions, the contents of the locale attribute would be able to be passed in without additional processing (presumably the library would throw if the tag is illformed).

Other situations:

  • Each attribute should be searched for according to the priority list from the user.
  • It is invalid for a localization container to contain an attribute not present in the default container, i.e localizations must be subsets of the default locale
  • In addition to global attributes, variable attributes may have localizations using the same locale and localizations list mechanism. (Question: should the locale attribute be optional if locale is present in the global attributes?)

@rmendels
Copy link

@DocOtak @larsbarring - Above it says:

"The last three bullet points I view as a security feature that might be relevant when implementing in something like ERDDAP that I think has some US Government requirements imposed on it."

Besides the points made by @ChrisJohnNOAA that we try to be compatible with as many platforms as possible, which puts constraints that we can't control, we get all sorts of security scans, many of which flag everything under the sun. These are scanners like Nessus and Qualys, and we try to make sure that to the best of our knowledge an ERDDAP release will not fail a security scan. This can also place some restrictions on what we can do (and why we encourage people to upgrade to the latest version - the last thing we want is for an ERDDAP running anywhere to fail a scan, which could lead to a lot of shutdowns until fixed - many of you may not remember when that happened to OPenDAP/TDS quite a few years ago).

@DocOtak
Copy link
Member

DocOtak commented Oct 10, 2024

@rmendels I knew ERDDAP has some additional constraints like that, but wasn't sure the details. I do know that input sanitization is "hard" and want something that the code owners of ERDDAP can be confident in.

@larsbarring
Copy link
Contributor

larsbarring commented Oct 10, 2024

@DocOtak In principle I like the 10 bullet points of your "Container Variable" solution, but I am not sure how it plays out when other components of a file also is localized. Would you be able to create a small but complete CDL file ?

At the same time I (still) kind of like my own suggestion based on Erin's previous ideas. Here is the CDL for a small complete mockup example:

netcdf test_LB {
dimensions:
	lat = 5 ;
	lon = 1 ;
variables:
	double lat(lat) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
	double lon(lon) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
	float uas(lat, lon) ;
		uas:standard_name = "eastward_wind" ;
		uas:long_name = "Zonal Surface Wind Speed" ;
		uas:long_name_sv = "Zonal vindhastighet nära marken" ;
		uas:long_name_fr = "Vitesse du vent zonal en surface" ;
		uas:long_name_esmx = "Velocidad de viento en superficie" ;
		uas:units = "m s-1" ;
		uas:_FillValue = 1.e+20f ;
		uas:ancillary_variables = "uas_qc" ;
	byte uas_qc(lat, lon) ;
		uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
		uas_qc:long_name_sv = "Data kvalitet hos zonal vindhastighet nära marken" ;
		uas_qc:long_name_fr = "Qualité des données sur Vitesse du vent zonal en surface" ;
		uas_qc:long_name_esmx = "Calidad de datos de la velocidad de viento zonal en superficie" ;
		uas_qc:standard_name = "status_flag" ;
		uas_qc:_FillValue = -128b ;
		uas_qc:valid_range = 0b, 2b ;
		uas_qc:flag_values = 0b, 1b, 2b ;
		uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
		uas_qc:flag_meanings_sv = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
		uas_qc:flag_meanings_fr = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
		uas_qc:flag_meanings_esmx = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
// global attributes:
	:Conventions = "CF-1.8" ;
        :locales = "default:en-US _sv:sv _fr:fr _esmx:es-MX" ;
        :title = "This is a test" ;
        :title_sv = "Detta är ett test" ;
        :title_fr = "Ceci est un essai" ;
        :title_esmx = "Este es un ensayo" ;
data:
 lat = 0, 5, 10, 15, 20 ;
 lon = 0 ;
 uas = 1, 2, 4, 48, 160 ;
 uas_qc = 0, 0, 0, 2, 1 ;
}

It does not tick off as many of the 10 bullet points, and I will in no way try to push for this solution.

But maybe @DocOtak you could do something similar for you "Container Variable" solution to let the ERDDAP folks -- and everyone else of course -- see for themselves and create small test files using ncgen -b.

@ChrisJohnNOAA
Copy link

both of which can have limitations on what characters can be used in variable names.

@ChrisJohnNOAA, okay. How about we focus on attribute names only, and skip variable names for now? By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default. Will that be okay for ERDDAP?

I was asked what the ERDDAP concern with ".fr-CA" was and I mentioned the concerns because in the past I've seen recommendations to treat Attribute and Variable names the same in CF. To be clear I think allowing '.' and '-' in variable names is a very bad idea.

I don't think attribute names are generally exported from ERDDAP for non-nc file formats. Most likely using '.' and '-' in attribute names would be fine from the ERDDAP perspective, though I haven't fully audited the ERDDAP code for that.

@sethmcg
Copy link
Contributor

sethmcg commented Oct 10, 2024

By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default.

I'm not convinced that's a safe assumption. I mean, if you want to use that as a criterion for whether a user program is reasonable, fair enough. It's definitely what programs should do. But in terms of software that people actually use, I don't know that we can rely on that being true.

We have to remember that plenty of scientific software is written by people who are scientists first and coders second, and who may not follow best practices of software engineering. I will freely admit to being one of those people, and I have written lots of code that is very cavalier about things like checking inputs...

@DocOtak
Copy link
Member

DocOtak commented Oct 10, 2024

@larsbarring
I translated your example into what I have proposed, it is a bit longer, attached is also an actual netCDF file of this. A thing that I might do is pull out the flag definitions into their own container variables and reference the common localizations from all the variables that use them the same way the quantization parameters container variable is meant to be referenced.

I have two concerns with the way your example encodes this information:

  1. attribute name modification via suffixes or any other "mangling" is a completely new concept in CF that has no precedent as a way CF encodes any information. I think I've only seen this in the OGC netCDF-LD proposal. I personally think this is a concept that should not be introduced into CF at all and deserves significant debate on its own as a concept.
  2. The grammar that encodes the suffix to locale information seems to come from or be inspired by cell methods. I was having a conversation with one of your colleagues at SMHI (I'm having great trouble remembering his name), if I recall @davidhassell was also present. We were lamenting about the difficulty in parsing the these, that they were easy to write for a data producer, but very difficult to parse. In your example "default:en-US _sv:sv _fr:fr _esmx:es-MX" ; you have no space between the suffix and the language tag, but in the proposal text above (copied here) "default: en-CA _fr: fr-CA _es: es-MX"; there is a space. There is no defined by us delimiter between the different key:value pairs or we use the spaces in multiple contexts. In my opinion here, this type of encoding should not be used in any new concepts within CF. CF should avoid anything that has the data readers/users/consumers need to implement their own parsing to correctly interpret a string.

While the following example is long and verbose, I think that it is worth it because everything is mostly in data structures that are in some ways more "ready to go" in that no transformations need to occur. I also think that by including locale information within the netCDF file, we are explicitly making something that is not meant for humans to look at without the computer processing the locale data first. I don't ever look at all the .po or .mo files. For client software reading these netCDF files, I would expect the user to basically say "I want to look at this file, this is my locale" then the software sets the localized attributes on the actual variables, then removes all the container variables since they are not needed anymore.

test_LB_b.nc.gz

netcdf test_LB_b {
dimensions:
	lat = 5 ;
	lon = 1 ;
variables:
	double lat(lat) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:axis = "Y" ;
	double lon(lon) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
	float uas(lat, lon) ;
		uas:_FillValue = 1.e+20f ;
		uas:standard_name = "eastward_wind" ;
		uas:long_name = "Zonal Surface Wind Speed" ;
		uas:units = "m s-1" ;
		uas:ancillary_variables = "uas_qc" ;
		uas:localizations = "uas_locale1 uas_locale2 uas_locale3" ;
	byte uas_qc(lat, lon) ;
		uas_qc:_FillValue = -128b ;
		uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
		uas_qc:standard_name = "status_flag" ;
		uas_qc:valid_range = 0b, 2b ;
		uas_qc:flag_values = 0b, 1b, 2b ;
		uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
		uas_qc:localizations = "uas_qc_locale1 uas_qc_locale2 uas_qc_locale3" ;
	double g_locale1 ;
		g_locale1:title = "Detta är ett test" ;
		g_locale1:locale = "sv" ;
	double g_locale2 ;
		g_locale2:title = "Ceci est un essai" ;
		g_locale2:locale = "fr" ;
	double g_locale3 ;
		g_locale3:title = "Este es un ensayo" ;
		g_locale3:locale = "es-MX" ;
	double uas_locale1 ;
		uas_locale1:long_name = "Zonal vindhastighet nära marken" ;
		uas_locale1:locale = "sv" ;
	double uas_locale2 ;
		uas_locale2:long_name = "Vitesse du vent zonal en surface" ;
		uas_locale2:locale = "fr" ;
	double uas_locale3 ;
		uas_locale3:long_name = "Velocidad de viento en superficie" ;
		uas_locale3:locale = "es-MX" ;
	double uas_qc_locale1 ;
		uas_qc_locale1:long_name = "Data kvalitet hos zonal vindhastighet nära marken" ;
		uas_qc_locale1:flag_meanings = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
		uas_qc_locale1:locale = "sv" ;
	double uas_qc_locale2 ;
		uas_qc_locale2:long_name = "Qualité des données sur Vitesse du vent zonal en surface" ;
		uas_qc_locale2:flag_meanings = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
		uas_qc_locale2:locale = "fr" ;
	double uas_qc_locale3 ;
		uas_qc_locale3:long_name = "Calidad de datos de la velocidad de viento zonal en superficie" ;
		uas_qc_locale3:flag_meanings = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
		uas_qc_locale3:locale = "es-MX" ;

// global attributes:
		:Conventions = "CF-1.8" ;
		:title = "This is a test" ;
		:locale = "en-US" ;
		:localizations = "g_locale1 g_locale2 g_locale3" ;
data:

 lat = 0, 5, 10, 15, 20 ;

 lon = 0 ;

 uas =
  1,
  2,
  4,
  48,
  160 ;

 uas_qc =
  0,
  0,
  0,
  2,
  1 ;

 g_locale1 = _ ;

 g_locale2 = _ ;

 g_locale3 = _ ;

 uas_locale1 = _ ;

 uas_locale2 = _ ;

 uas_locale3 = _ ;

 uas_qc_locale1 = _ ;

 uas_qc_locale2 = _ ;

 uas_qc_locale3 = _ ;
}

@JonathanGregory
Copy link
Contributor

Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, as discussed in this issue.

If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version. @ChrisJohnNOAA commented above that "allowing '.' and '-' in variable names is a very bad idea". Please add your support to conventions issue 548 if you agree with @Dave-Allured that they should not be allowed.

@JonathanGregory
Copy link
Contributor

Dear Andrew @DocOtak, @larsbarring, @turnbullerin et al.

I agree with Andrew that using an algorithm to predict the name of an attribute would be unlike previous CF practice. Although we choose meaningful names for CF attributes, all those names are explicitly defined (in Appendix A and elsewhere). They have to be hard-coded in software, and in that sense they are treated as if they were arbitrary, like variable and dimension names are.

Furthermore, although we could make a convention with suffixes for attributes work in netCDF, it might not work in other formats CF data could be converted into. Another format might have different rules about characters allowed in names, or it might not even have names at all.

Therefore I prefer the container variable as demonstrated by @DocOtak, but I'd combine it with a "keyword: value" syntax like the one Erin @turnbullerin suggested. That is because this kind of attribute on the data variable tells you which container variable you want. Without it, you have to search them all to identify the right one, which isn't the general CF pattern. With this convention, the allowed keywords in the localizations attribute are any of the IETF BCP 47 language tags. The locale attribute is a data variable and global attribute, rather than a container variable attribute, so there are fewer attributes in total.

If I have understood correctly, @DocOtak, you don't like this kind of syntax. However, quite a lot of CF attributes use it. I don't think it's difficult to parse. It's a blank separated list of words, some of which (the keywords) end in :. There should be space between : and the value. Perhaps we should clarify this in the convention with a general statement. However, it's not hard to repair the mistake, if you want to tolerate it. The values and keywords never contain any other :, so a string substitution of ":" → ": " will convert the string into the correct format.

With this syntax, the example would be as below.

Best wishes

Jonathan

netcdf test_LB_b {
dimensions:
  lat = 5 ;
  lon = 1 ;
variables:
  double lat(lat) ;
    lat:standard_name = "latitude" ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;
    lat:axis = "Y" ;
  double lon(lon) ;
    lon:standard_name = "longitude" ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
    lon:axis = "X" ;
  float uas(lat, lon) ;
    uas:_FillValue = 1.e+20f ;
    uas:standard_name = "eastward_wind" ;
    uas:long_name = "Zonal Surface Wind Speed" ;
    uas:units = "m s-1" ;
    uas:ancillary_variables = "uas_qc" ;
    uas:locale = "en-US" ;
    uas:localizations = "sv: uas_locale1 fr: uas_locale2 es-MX: uas_locale3" ;
  byte uas_qc(lat, lon) ;
    uas_qc:_FillValue = -128b ;
    uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
    uas_qc:standard_name = "status_flag" ;
    uas_qc:valid_range = 0b, 2b ;
    uas_qc:flag_values = 0b, 1b, 2b ;
    uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
    uas:locale = "en-US" ;
    uas_qc:localizations = "sv: uas_qc_locale1 fr: uas_qc_locale2 es-MX: uas_qc_locale3" ;
  double g_locale1 ;
    g_locale1:title = "Detta är ett test" ;
  double g_locale2 ;
    g_locale2:title = "Ceci est un essai" ;
  double g_locale3 ;
    g_locale3:title = "Este es un ensayo" ;
  double uas_locale1 ;
    uas_locale1:long_name = "Zonal vindhastighet nära marken" ;
  double uas_locale2 ;
    uas_locale2:long_name = "Vitesse du vent zonal en surface" ;
  double uas_locale3 ;
    uas_locale3:long_name = "Velocidad de viento en superficie" ;
  double uas_qc_locale1 ;
    uas_qc_locale1:long_name = "Data kvalitet hos zonal vindhastighet nära marken" ;
    uas_qc_locale1:flag_meanings = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
  double uas_qc_locale2 ;
    uas_qc_locale2:long_name = "Qualité des données sur Vitesse du vent zonal en surface" ;
    uas_qc_locale2:flag_meanings = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
  double uas_qc_locale3 ;
    uas_qc_locale3:long_name = "Calidad de datos de la velocidad de viento zonal en superficie" ;
    uas_qc_locale3:flag_meanings = "calidad_buena sensor_no_funcional fuera_rango_válido" ;

// global attributes:
    :Conventions = "CF-1.8" ;
    :title = "This is a test" ;
    :locale = "en-US" ;
    :localizations = "sv: g_locale1 fr: g_locale2 es-MX: g_locale3" ;
data:

 lat = 0, 5, 10, 15, 20 ;

 lon = 0 ;

 uas =
  1,
  2,
  4,
  48,
  160 ;

 uas_qc =
  0,
  0,
  0,
  2,
  1 ;

// container variables, contents immaterial:
 g_locale1 = _ ;

 g_locale2 = _ ;

 g_locale3 = _ ;

 uas_locale1 = _ ;

 uas_locale2 = _ ;

 uas_locale3 = _ ;

 uas_qc_locale1 = _ ;

 uas_qc_locale2 = _ ;

 uas_qc_locale3 = _ ;
}

@DocOtak
Copy link
Member

DocOtak commented Oct 20, 2024

@JonathanGregory strong disagree that needing to search the attributes of referenced variables is not general CF pattern. It is very pervasive and probably even a fundamental CF pattern with usage prominently in ancillary_variables and coordinates attributes. Ancillary variables contain status flags, uncertainties (e.g. standard error), etc... There are 19 standard names that say you need to use the ancillary_variables to figure out linkage. Things like angles, wavelengths, vertical extents are all referenced in coordinates, there so many example of where coordinates are used that I couldn't figure out how to get an accurate count from the standard name able by searching it (its in the hundreds). For all of these, you won't know what the referenced variable contains until you read their attributes.

In your specific example, I'm not sure I like the lack of locale on the localization variables themselves. Without the context of the referencing variables, what locale should I assume they are? Would it be the global locale attribute which is "en-US" in this case? Continuing with that line of thinking, I don't think the repeated "en-US" locale on the uas and uas_QC variables is necessary.

Re the key: value pattern, you are correct that I really don't like it, and I find issue not with the "ease" of parsing, but that we are asking someone to write their own parser at all:

  • It doesn't make use of the netCDF structures already available: attributes are key-value pairs, and variables can be used as collections of key-value pairs.
  • The strings are not some existing markup language: it's not JSON, YAML, TOML, or even xml etc...
  • There is no unique delimiter between the key: value pairs.
  • Most instances that I can find in the CF document define by example and not by actual rules.

The most recent addition of something that looks like key: value is the units metadata, which avoids the problems by defining three exact strings that only look like key: value pairs.

I feel somewhat strongly that if CF wants to continue to use its own bespoke syntax for these string attributes, it needs to define the grammar of them formally in some way, e.g. using EBNF/ISO 14977. I don't think that a regex would be acceptable here either.

@JonathanGregory JonathanGregory added the CF1.12? We might conclude this issue in time for CF1.12 label Oct 20, 2024
@JonathanGregory
Copy link
Contributor

Good morning, @DocOtak

Regarding your comment that the need to search the attributes of referenced variables is a pervasive CF pattern, whereas I said that it isn't CF-like. I'm sorry that I didn't consider this remark more carefully! You're right that there are cases where an attribute names several variables (such as ancillary_variables and coordinates, as you say) and you have to inspect them to find out which is which.

I was thinking instead of the situations where an attribute identifies variables by their purpose, such as formula_terms and cell_measures, which use the "keyword: value" syntax. Where we can do it, this method seems more convenient to me, because it's easier to find what you need.

In my version of the localization example, you have the data variable uas, which has a long_name, and the locale attribute tells you the long_name is English. If you want a French version, you inspect the localization attribute to see if it has a fr: keyword. You find that it does, and the value uas_locale2 names the variable which contains the long_name attribute in French. On the other hand, you can see there is no de: keyword in the localization attribute, so without inspecting any variables you know that the long_name doesn't have a German version. I think that's convenient.

My version is no more than a rearrangement of yours. I have replaced the locale attributes of the container variables with the keywords of the localization attribute of the referencing variable - that's all.

I don't think the container variables need a locale, because they are subsidiary variables, the way I understand it. They are adjuncts to the referencing variable. They host alternative versions of some of its attributes. I regard this as like boundary variables, which are subsidiary to coordinate variables, and therefore they don't need metadata of their own in general. They're adequately described by the variable which references them.

I agree with you that we don't need the locale attribute of the data variables uas and uas_QC if we say that the file attribute locale supplies a default locale for all data variables, as well as the locale for any other file attributes. That would be even simpler and better, and I would prefer it.

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

Dear Andrew @DocOtak

You made a good point that "attributes are key-value pairs". We use that idea for various kinds of container variable, such as grid mapping. Here's a modified version of my previous example (itself a modified version of yours), in which I use a "supercontainer" variable, instead of an attribute containing key-value pairs, to point to the localized metadata containers. The supercontainer may have any attribute name which is a legal language tag, and no other attributes.

Do you prefer this? I have also assumed that the file locale attribute is a default for data variables, as discussed above.

Best wishes

Jonathan

netcdf test_LB_b {
dimensions:
  lat = 5 ;
  lon = 1 ;
variables:
  double lat(lat) ;
    lat:standard_name = "latitude" ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;
    lat:axis = "Y" ;
  double lon(lon) ;
    lon:standard_name = "longitude" ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
    lon:axis = "X" ;
  float uas(lat, lon) ;
    uas:_FillValue = 1.e+20f ;
    uas:standard_name = "eastward_wind" ;
    uas:long_name = "Zonal Surface Wind Speed" ;
    uas:units = "m s-1" ;
    uas:ancillary_variables = "uas_qc" ;
    uas:localizations = "uas_localizations";
  float uas_localizations;
    uas_localizations:sv="uas_locale1";
    uas_localizations:fr="uas_locale2;
    uas_localizations:es-MX="uas_locale3" ;
  byte uas_qc(lat, lon) ;
    uas_qc:_FillValue = -128b ;
    uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
    uas_qc:standard_name = "status_flag" ;
    uas_qc:valid_range = 0b, 2b ;
    uas_qc:flag_values = 0b, 1b, 2b ;
    uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
    uas_qc:localizations = "uas_qc_localizations";
  float uas_qc_localizations;
    uas_qc_localizations:sv="uas_qc_locale1";
    uas_qc_localizations:fr="uas_qc_locale2;
    uas_qc_localizations:es-MX="uas_qc_locale3" ;
  double uas_locale1 ;
    uas_locale1:long_name = "Zonal vindhastighet nära marken" ;
  double uas_locale2 ;
    uas_locale2:long_name = "Vitesse du vent zonal en surface" ;
  double uas_locale3 ;
    uas_locale3:long_name = "Velocidad de viento en superficie" ;
  double uas_qc_locale1 ;
    uas_qc_locale1:long_name = "Data kvalitet hos zonal vindhastighet nära marken" ;
    uas_qc_locale1:flag_meanings = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
  double uas_qc_locale2 ;
    uas_qc_locale2:long_name = "Qualité des données sur Vitesse du vent zonal en surface" ;
    uas_qc_locale2:flag_meanings = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
  double uas_qc_locale3 ;
    uas_qc_locale3:long_name = "Calidad de datos de la velocidad de viento zonal en superficie" ;
    uas_qc_locale3:flag_meanings = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
  float localizations;
    localizations:sv="g_locale1";
    localizations:fr="g_locale2;
    localizations:es-MX="g_locale3" ;
  double g_locale1 ;
    g_locale1:title = "Detta är ett test" ;
  double g_locale2 ;
    g_locale2:title = "Ceci est un essai" ;
  double g_locale3 ;
    g_locale3:title = "Este es un ensayo" ;

// global attributes:
    :Conventions = "CF-1.8" ;
    :title = "This is a test" ;
    :locale = "en-US" ;
    :localizations = "localizations";
data:

 lat = 0, 5, 10, 15, 20 ;

 lon = 0 ;

 uas =
  1,
  2,
  4,
  48,
  160 ;

 uas_qc =
  0,
  0,
  0,
  2,
  1 ;

// container variables, contents immaterial:
 g_locale1 = _ ;

 g_locale2 = _ ;

 g_locale3 = _ ;

 uas_locale1 = _ ;

 uas_locale2 = _ ;

 uas_locale3 = _ ;

 uas_qc_locale1 = _ ;

 uas_qc_locale2 = _ ;

 uas_qc_locale3 = _ ;
}

@JonathanGregory JonathanGregory removed the CF1.12? We might conclude this issue in time for CF1.12 label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

No branches or pull requests

8 participants