From dae91fec6927dcc8a546bced441ae8975008b68e Mon Sep 17 00:00:00 2001
From: Jonathan Gregory <j.m.gregory@reading.ac.uk>
Date: Tue, 22 Oct 2024 16:19:17 +0100
Subject: [PATCH 1/7] update following review

---
 ch02.adoc    | 19 +++++++++++++------
 history.adoc |  1 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/ch02.adoc b/ch02.adoc
index a0e569bf..548ed985 100644
--- a/ch02.adoc
+++ b/ch02.adoc
@@ -12,18 +12,25 @@ NetCDF files should have the file name extension "**`.nc`**".
 
 // TODO: Check, should this be a bullet list?
 Data variables must be one of the following data types: **`string`**, **`char`**, **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, **`unsigned int64`**, **`float`** or **`real`**, and **`double`** (which are all the link:$$https://docs.unidata.ucar.edu/nug/current/md_types.html$$[netCDF external data types] supported by netCDF-4).
-The **`string`** type is only available in files using the netCDF version 4 (netCDF-4) format.
+The **`string`** type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format.
 The **`char`** and **`string`** types are not intended for numeric data.
 One byte numeric data should be stored using the **`byte`** or **`unsigned byte`** data types.
 It is possible to treat the **`byte`** and **`short`** types as unsigned by using the NUG convention of indicating the unsigned range using the **`valid_min`**, **`valid_max`**, or **`valid_range`** attributes.
 In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.
 
-Strings in variables may be represented one of two ways - as atomic strings or as character arrays.
-An n-dimensional array of strings may be implemented as a variable of type **`string`** with _n_ dimensions, or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
-For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
-The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
-If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
+Text strings must be represented in Unicode. Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].
+A Unicode text string may be stored in netCDF either in a variable-length netCDF **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
+Note that the ASCII character one-byte characters are a subset of Unicode, and their UTF-8 encodings are the same as their ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
+
+Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
+If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
+If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.
+
+An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
+For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
+The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled.
+A **`string`** variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length.
 The CDL example below shows one variable of each type.
 
 [[char-and-string-variables-ex]]
diff --git a/history.adoc b/history.adoc
index d7da1bf1..bfc2084f 100644
--- a/history.adoc
+++ b/history.adoc
@@ -7,6 +7,7 @@
 
 === Working version (most recent first)
 
+* {issues}141[Issue #141]: Clarification that text-valued variables and attributes can be Unicode vlen strings or UTF-8 char arrays.
 * {issues}367{Issue #367}: Remove the AMIP and GRIB columns from the standard name table format defined by Appendix B.
 * {issues}403[Issue #403]: Metadata to encode quantization properties
 * {issues}530[Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)".

From 62c92a361fb03e9a07091eb4c9070acba90d20d1 Mon Sep 17 00:00:00 2001
From: Jonathan Gregory <j.m.gregory@reading.ac.uk>
Date: Tue, 22 Oct 2024 16:29:15 +0100
Subject: [PATCH 2/7] add conformance

---
 ch02.adoc        | 2 +-
 conformance.adoc | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/ch02.adoc b/ch02.adoc
index 548ed985..09a656ac 100644
--- a/ch02.adoc
+++ b/ch02.adoc
@@ -20,7 +20,7 @@ In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.
 
 Text strings must be represented in Unicode. Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].
-A Unicode text string may be stored in netCDF either in a variable-length netCDF **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
+A Unicode text string can be stored in netCDF either in a variable-length netCDF **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
 Note that the ASCII character one-byte characters are a subset of Unicode, and their UTF-8 encodings are the same as their ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
 
 Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
diff --git a/conformance.adoc b/conformance.adoc
index 4e490ab9..bfadb47b 100644
--- a/conformance.adoc
+++ b/conformance.adoc
@@ -27,7 +27,11 @@ See https://github.com/ugrid-conventions/ugrid-conventions for the UGRID conform
 
 *Requirements:*
 
-* CF attributes that take string values must be 1D character arrays or single atomic strings.
+* Any text stored in a CF attribute or variable must be represented in Unicode.
+
+* Unicode data stored as a `char` array must be encoded in UTF-8.
+
+* If a text-valued attribute is stored in a variable-length `string`, it must have a scalar value.
 
 [[section-1]]
 

From 995de3d609f0be7b0f113ebd7a659683005289bf Mon Sep 17 00:00:00 2001
From: Jonathan Gregory <j.m.gregory@reading.ac.uk>
Date: Tue, 22 Oct 2024 16:36:53 +0100
Subject: [PATCH 3/7] update

---
 ch02.adoc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ch02.adoc b/ch02.adoc
index 09a656ac..0e0e5d8e 100644
--- a/ch02.adoc
+++ b/ch02.adoc
@@ -20,8 +20,8 @@ In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.
 
 Text strings must be represented in Unicode. Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].
-A Unicode text string can be stored in netCDF either in a variable-length netCDF **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
-Note that the ASCII character one-byte characters are a subset of Unicode, and their UTF-8 encodings are the same as their ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
+A Unicode text string can be stored in netCDF either in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
+Note that the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
 
 Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
 If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.

From f7eff54eb151b3f0e5d7565cbdbbde0c5a9058cd Mon Sep 17 00:00:00 2001
From: Jonathan Gregory <j.m.gregory@reading.ac.uk>
Date: Wed, 23 Oct 2024 18:20:12 +0100
Subject: [PATCH 4/7] update

---
 ch02.adoc | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/ch02.adoc b/ch02.adoc
index 0e0e5d8e..08415df5 100644
--- a/ch02.adoc
+++ b/ch02.adoc
@@ -19,13 +19,13 @@ It is possible to treat the **`byte`** and **`short`** types as unsigned by usin
 In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.
 
-Text strings must be represented in Unicode. Any composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].
-A Unicode text string can be stored in netCDF either in a variable-length **`string`** or encoded as UTF-8 in a fixed-length **`char`** array.
-Note that the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
+A text string can be stored either in a variable-length **`string`** or in a fixed-length **`char`** array.
+In both cases, text strings must be represented in Unicode and encoded according to UTF-8.
+A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
+Any Unicode composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].
 
-Before version 1.12, CF did not require text in **`char`** arrays to be encoded with UTF-8, and did not provide or endorse any convention to record what encoding was used.
-If the array is stored in a variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
-If the data-user has no information about the encoding, we suggest UTF-8 as a first guess.
+Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used.
+However, if the text string is stored in a **`char`** variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
 
 An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
 For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.

From d5e4bde4e1816aef7b538fff6410ea1f47f49e28 Mon Sep 17 00:00:00 2001
From: JonathanGregory <j.m.gregory@reading.ac.uk>
Date: Fri, 25 Oct 2024 22:09:54 +0100
Subject: [PATCH 5/7] updates

---
 ch02.adoc        | 3 +--
 conformance.adoc | 4 +---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/ch02.adoc b/ch02.adoc
index 08415df5..e5e47632 100644
--- a/ch02.adoc
+++ b/ch02.adoc
@@ -20,9 +20,8 @@ In many situations, any integer type may be used.
 When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.
 
 A text string can be stored either in a variable-length **`string`** or in a fixed-length **`char`** array.
-In both cases, text strings must be represented in Unicode and encoded according to UTF-8.
+In both cases, text strings must be represented in Unicode Normalization Form C (NFC, link:$$https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf$$[section 3.11] and link:$$https://unicode.org/reports/tr15$$[Annex 15] of the Unicode standard) and encoded according to UTF-8.
 A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
-Any Unicode composite characters must be link:$$https://unicode.org/reports/tr15$$[NFC-normalized].
 
 Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used.
 However, if the text string is stored in a **`char`** variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
diff --git a/conformance.adoc b/conformance.adoc
index bfadb47b..0c064405 100644
--- a/conformance.adoc
+++ b/conformance.adoc
@@ -27,9 +27,7 @@ See https://github.com/ugrid-conventions/ugrid-conventions for the UGRID conform
 
 *Requirements:*
 
-* Any text stored in a CF attribute or variable must be represented in Unicode.
-
-* Unicode data stored as a `char` array must be encoded in UTF-8.
+* Any text stored in a CF attribute or variable must be represented in Unicode Normalization Form C and encoded in UTF-8.
 
 * If a text-valued attribute is stored in a variable-length `string`, it must have a scalar value.
 

From 6866cfd387b23c0bba8a0df152e55f969b36ce0f Mon Sep 17 00:00:00 2001
From: JonathanGregory <j.m.gregory@reading.ac.uk>
Date: Tue, 29 Oct 2024 14:03:16 +0000
Subject: [PATCH 6/7] updates following review

---
 ch02.adoc        | 2 +-
 conformance.adoc | 2 +-
 history.adoc     | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/ch02.adoc b/ch02.adoc
index e5e47632..7f9d83cb 100644
--- a/ch02.adoc
+++ b/ch02.adoc
@@ -21,7 +21,7 @@ When the phrase "integer type" is used in this document, it should be understood
 
 A text string can be stored either in a variable-length **`string`** or in a fixed-length **`char`** array.
 In both cases, text strings must be represented in Unicode Normalization Form C (NFC, link:$$https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf$$[section 3.11] and link:$$https://unicode.org/reports/tr15$$[Annex 15] of the Unicode standard) and encoded according to UTF-8.
-A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
+A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their NFC UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).
 
 Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used.
 However, if the text string is stored in a **`char`** variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.
diff --git a/conformance.adoc b/conformance.adoc
index 0c064405..b1f28878 100644
--- a/conformance.adoc
+++ b/conformance.adoc
@@ -29,7 +29,7 @@ See https://github.com/ugrid-conventions/ugrid-conventions for the UGRID conform
 
 * Any text stored in a CF attribute or variable must be represented in Unicode Normalization Form C and encoded in UTF-8.
 
-* If a text-valued attribute is stored in a variable-length `string`, it must have a scalar value.
+* Any attribute of variable-length string type must be a scalar (not an array).
 
 [[section-1]]
 
diff --git a/history.adoc b/history.adoc
index bfc2084f..2609ffb8 100644
--- a/history.adoc
+++ b/history.adoc
@@ -7,7 +7,7 @@
 
 === Working version (most recent first)
 
-* {issues}141[Issue #141]: Clarification that text-valued variables and attributes can be Unicode vlen strings or UTF-8 char arrays.
+* {issues}141[Issue #141]: Clarification that text may be stored in variables and attributes as either vlen strings or char arrays, and must be represented in Unicode Normalization Form C and encoded according to UTF-8.
 * {issues}367{Issue #367}: Remove the AMIP and GRIB columns from the standard name table format defined by Appendix B.
 * {issues}403[Issue #403]: Metadata to encode quantization properties
 * {issues}530[Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)".

From 6fbe865b62d7247a41b21afc9eccb3f608f537de Mon Sep 17 00:00:00 2001
From: JonathanGregory <j.m.gregory@reading.ac.uk>
Date: Tue, 29 Oct 2024 14:04:43 +0000
Subject: [PATCH 7/7] typo

---
 history.adoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/history.adoc b/history.adoc
index 2609ffb8..f1812116 100644
--- a/history.adoc
+++ b/history.adoc
@@ -8,7 +8,7 @@
 === Working version (most recent first)
 
 * {issues}141[Issue #141]: Clarification that text may be stored in variables and attributes as either vlen strings or char arrays, and must be represented in Unicode Normalization Form C and encoded according to UTF-8.
-* {issues}367{Issue #367}: Remove the AMIP and GRIB columns from the standard name table format defined by Appendix B.
+* {issues}367[Issue #367]: Remove the AMIP and GRIB columns from the standard name table format defined by Appendix B.
 * {issues}403[Issue #403]: Metadata to encode quantization properties
 * {issues}530[Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)".
 * {issues}163[Issue #163]: Provide a convention for boundary variables for grids whose cells do not all have the same number of sides.