-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for variables of type string #139
Comments
It is probably better to debate changes in GitHub issues rather than in pull requests. Therefore I am re-posting my comment about character arrays here (next comment), and removing it from the pull request #138. Also I was once told that there is a preference for discussing something in issues first, then create the pull request later. This issue was started in the reverse order, thus I was a bit confused. I am a relative newbie to GitHub, so please educate me if I am off base about this. Just trying to do The Right Thing. |
Variable length strings can be stored in character arrays by the use of either null padding or blank padding. The storage is fixed length, but the purpose is variable length. The following sentence in the original pull request conflicts with this usage: "If the character array option is chosen, the strings in a given array are defined to be equal in length." This sentence is not essential to the description of strings within character type. It is an unneeded constraint. This sentence is also not necessary for the introduction of string data type. Therefore I request this sentence simply be removed. Thanks. |
If it is discussing a suggested localized suggested change to text -- a PR is the perfect place for it. Which this is, yes? Is this a contentious issue? or just trying to come up with the best wording? -CHB |
I will bear that in mind for future issues. On this one, there is substance in my proposed wording. Let's just keep it here and see if there is any objection to my view of variable length string storage. |
Dear all I support this change being made, but I think it may require changes in more than one place in the conventions document. It needs a general statement somewhere early that an array of character data with a trailing dimension of maximum string length is equivalent to an array of strings without that dimension. Anywhere in the text which mentions the use of a character array to contain a string should be modified, and also any examples which show this could mention it as an alternative. Following the discussion at the meeting, I think this applies only to character variables, not character attributes. Is that correct? Possibly changes are needed in the conformance document too - I haven't looked. Cheers Jonathan |
I was hoping that stating both ways of forming string variables were valid and effectively interchangeable would be sufficient. I'm not sure where string variables show up in the rest of the document. Any assistance with tracking those places down would be appreciated. I'm working on another change to address string attributes. The issue there is slightly different. |
Dear Jim I think that would make the document somewhat less useful, especially for anyone who looks up something in the middle of it with a presumption that strings should be allowed, and doesn't find them shown in text or examples. Most users of the document don't sit down and read it from start to finish. Moreover I feel that we must check carefully that this won't cause problems - I don't think it will, but it is better to be safe. In fact the work involved in making this change is mostly that of reading the documents to check what might be affected, I suspect. Best wishes Jonathan |
@Dave-Allured, I've modified the wording. See what you think. |
In regards to the other sections concern raised by @JonathanGregory, the chapter 6 section 1 on "Labels" has some language talking about character arrays. This was the only other part of the document I could find which talks explicitly about strings in variable data. |
Appendix H contains quite a few examples using string variables of type |
There are small changes in chapters 5 and 7 as well. |
If |
Perhaps alternate examples? Adding alternatives verbiage to each seems excessive. |
I agree with the new wording in PR #138. This handles the concept of variable length strings stored within fixed dimension character arrays. Thanks, Jim. |
I agree with @ChrisBarker-NOAA in that I feel updating existing examples is unnecessary unless Since there are now two ways of representing equivalent data values, maybe just provide a single example of how to represent the same text variable data as both string and character arrays. Other examples are trying to demonstrate different things. |
Replaced my first pull request by a more complete one, and added a pull request for the Conformance document. |
Trac is up again, so I made a Trac ticket as a placeholder. |
Dear Jim Many thanks for doing a thorough job in finding where changes are needed.
Best wishes Jonathan |
Sorry, I didn't mean to close it! By the way, I found it really convenient to look at your changes with the rich-diff comparison that I learned about at the meeting. Thanks. |
@JonathanGregory I have added the issue number to the history and attempted to fixe the backticks. I had changed half the examples, overall, to string from char. I can add an example if you think we still need one. |
Note: Recent experiments with the string data type learned that some tools, like MATLAB, still don't support that data type via the NetCDF4 interface. The work around is to reopen the file using the HDF5 interface.
|
@JimBiardCics - Thanks for the changes. Yes, I think an illustration of an equivalent string array and char array, with nothing else, would be useful in ch02. You could also comment that the examples in the remainder of the document are some of one and some of the other. Jonathan |
@JonathanGregory - How's that? |
Lovely. Many thanks. Jonathan |
@hrajagers - I've added a sentence addressing the netCDF-4 issue for |
@hrajagers brings up encoding -- always a challenge! Any reason we can't say that ALL "things" of the string type are utf-8? period. end of story. I'd also love to say that all CHAR data is ASCII (or maybe latin-1) -- if you need unicode, use a string. The odds are very good that if you are dealing with software that can only handle char, it isn't going to handle Unicode well anyway. Reasoning: For "over the wire" encoding, utf-8 encoding is the best choice, and has become a defacto standard - (and an actual standard for e.g. JSON). And lots of people think "utf-8" == "Unicode" -- they are wrong, but if we always use utf-8, then people and tools that handle unicode properly will work well, and tools and people that don't will still mostly work. See: http://utf8everywhere.org/ for a strong opinion. Personally, I think they are wrong about "in memory", but their arguments do apply to "on disk" or "over the wire" -- essentially any interchange situation. As for ASCII for CHAR -- the char type (at least in arrays) has to be fixed length -- utf-8 is not a fixed-length encoding -- that is, 10 "characters" may require 10 or more bytes to store. And if a string is truncated naively, it could result in an invalid string. Since netcdf provides a variable length string that's the obvious way to deal with Unicode. And UTF-8 is the obvious way to encode Unicode. |
To discuss of ASCII / UTF-8, please go to issue #141. I'm hoping to address those questions there. |
I was about to post there, but that seems to be about attributes in particular, which is only a subset of the issue. but we're cross-referenced now :-) |
@ChrisBarker-NOAA It seems to me that the encoding question more directly comes to a head with attributes, but the two are definitely intertwined. I didn't want to do all of it in one giant change, so that's where I decided to make the break. |
So, getting back to this issue, I like @ChrisBarker-NOAA's idea of stating that string variables (and attributes too, but I'll talk about that in #141) should be treated as utf-8 and that char variables are to be restricted to ASCII (or perhaps latin-1). If there is demand, we could open a new ticket to add an 'encoding' attribute for other cases, but I feel like we could leave that sleeping dog lie for now. Thoughts? |
+1 (obviously). And I agree -- the "encoding" attribute could be added later if there is real demand, though I'd much rather not allow arbitrary encoding. We may want to have an encoding attribute that says "utf-8", so that it will be clear to folks reading the file, but that would be the only valid value for CF. |
One more note: from #141, someone wrote: "In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. " So that's it -- "strings will be UTF-8" we're done with that part :-). Dealing with unicode with bare CHAR arrays is a bad idea -- sure you can do it, as a CHAR array can hold anything -- but we don't recommend using char arrays to hold, e.g. floating point data with an attribute saying that it's an IEEE 754 32 bit float -- we could do that, but it would be a really bad idea. The one use case that makes at least a little bit of sense is to use CHAR arrays to pass encoded data around, if and only if the library doing the passing around is not expected to interpret the data as text in any way. This is why *nix systems have been able to get away with poorly specified encoding of filenames for so long. But do CF-data handling libraries ever do that? In short, does anyone need Unicode that can't use a string type? If modern Fortran netcdf libs can't handle strings, I can't imagine they do the right thing with Unicode anyway. |
That would be me and I got called out on this so let me set the record straight. The HDF5 library does not verify or enforce the encoding of bytes stored in HDF5 strings. The HDF5 string datatype has a tag for declaring the encoding and currently supports two values, for ASCII and UTF-8. The application is responsible for setting the correct tag value. |
OK, so maybe HDF-% doesn't require UTF-* only, but it also doesn't support anything else -- at least for now. Is seem is would be a really bad idea to do something unsupported for CF. |
Hello @JimBiardCics CFers I'm happy to offer to moderate this discussion, if that is helpful. @JimBiardCics please may I ask that you update the description of the pull request to use the text thank you |
Please may I ask contributors to this ticket to consider whether there are any in-principal objections, concerns, or key factors to consider to this change, that are still outstanding. If there are targeted comments on the text of the proposed change, please target them at #140, as a review of the textual change. thank you |
I support this change almost as it stands, and am grateful to Jim for preparing it. I have one textual concern. In ch06, in the "Labels" section, "not" has been deleted in the sentence, "[a scalar coordinate variable] has the same information content and can be used in the same contexts as a string-valued auxiliary coordinate variable of a size one dimension which has not been added to the data variable. This is a convenience feature." The deletion of "not" shows that it wasn't clear what the sentence is trying to say. Rather than trying to explain it here, I suggest ending the sentence at "size one dimension", followed by the next sentence, "This is a convenience feature." It's explained fully in Section 5.7 on scalar coordinate variables, so there's no need for a half-hearted explanation here. Cheers, Jonathan |
@marqh I added a |
@JonathanGregory I think that's a great idea. I'll take care of it. |
Hi @JimBiardCics , this looks good. I'm not sure I've understood the pull-request workflow here, but it looks as though there have been 9 updates recorded in |
@martinjuckes We'll do a merge. If there are colliding changes, we'll sort them out. |
thank you to all contributors for the input. There have now been 26 days with no objections. I think this change proposal is suitable for acceptance. I suggest that @JimBiardCics and I ensure that the PR #140 is finalised and suitable for merge, then we will confirm that there is no material change from the current commit: 8c7d66d and implement the adoption of this change into the master version of the conventions |
While not appropriate for this PR/Issue since it is the topic of strings. Perhaps let's also address the current support netcdf4 has for actual unsigned int types in another issue? |
@DocOtak It sounds like a great candidate for another issue/PR! |
This change is the outcome of discussions at the 2018 netCDF/CF Workshop meeting in Reading, UK. The proposal is to change the language of Chapter 2 to include the
string
type as a valid variable type, and to modify the description of variables containing text to include variables with a type ofstring
.The changes can be seen in pull requests #140 and cf-convention/Conformance#7
Placeholder Trac ticket at https://cf-trac.llnl.gov/trac/ticket/175#ticket
The text was updated successfully, but these errors were encountered: