Replies: 2 comments
-
That's correct.
The ORB has a "native code set for wchar" (and one for char) and a list of "conversion code sets for wchar" - these are configurable. The peer ORB(s) need to have some conversion code set in common and one of those becomes the "transmission code set" which is used on the wire. The application data sent through the ORB is in the "native code set" (this is probably a simplification). TAO's default native code set for wchar is UTF-16. It can be changed at compile time by defining TAO_DEFAULT_WCHAR_CODESET_ID.
TAO doesn't provide a translator, instead it provides the framework that an application can use to install its own translator. |
Beta Was this translation helpful? Give feedback.
-
See #2145 for a PR related to an issue as we attempted to utilize the WCS4_UTF16.cpp file |
Beta Was this translation helpful? Give feedback.
-
We are attempting to use wide strings (CORBA::WChar*) on Linux, however we are finding that the encoding does not seem to behave as expected for characters whose representations do not fit into a single UTF-16 character.
On Linux, wchar_t is a 32-bit type. When a wide string comes over the wire, including from other ORBs, the default behavior is that each character in the string is the next 16-bit unit of the UTF-16 representation of the string, rather than the next character in the native UTF-32 layout. Is this intended?
For example, 🂡🂮🂭🂫🂪 is being marshaled as L"\xd83c\xdca1\xd83c\xdcae\xd83c\xdcad\xd83c\xdcab\xd83c\xdcaa"
A search of the source tree found references to -ORBNativeWcharCodeSet UCS-4 -ORBWcharCodesetTranslator WUCS4_UTF16_Factory in TAO/tests/CodeSets/simple/. Use of -ORBNativeWcharCodeSet UCS-4 produces an outcome where the string is represented as an array of wchar_t twice the length of the UTF-32 representation where each pair of elements are filled with the low 16 bits of the UTF-32 character followed by the high 16 bits of the UTF-32 character.
For example, 🂡🂮🂭🂫🂪 is being marshaled as L"¡\001®\001¬\001«\001ª\001"
Character reference: https://en.wikipedia.org/wiki/Playing_cards_in_Unicode
The only references to WUCS4_UTF16_Factory seem to be in the tests tree. Is that class intended to be a supported feature? If so, does anything need to be built in a special way to enable it? Passing in the -ORBWcharCodesetTranslator WUCS4_UTF16_Factory argument seems to have no effect and there are no symbols by that name in any of the libraries, per nm -gC.
We are using ACE/TAO version 6.5.5. Have any changes or improvements been made in newer releases?
Beta Was this translation helpful? Give feedback.
All reactions