Skip to content

Commit

Permalink
perldelta for utf8_to_uv() family
Browse files Browse the repository at this point in the history
  • Loading branch information
khwilliamson committed Dec 2, 2024
1 parent ae865e7 commit cffb5af
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 3 deletions.
36 changes: 36 additions & 0 deletions pod/perldelta.pod
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,42 @@ well.

=item *

New API functions are introduced to convert strings encoded in UTF-8 to
their ordinal code point equivalent. These are safe to use by default,
and generally more convenient to use than the existing ones.

L<perlapi/C<utf8_to_uv>> replaces L<perlapi/C<utf8_to_uvchr>> (which is
retained for backwards compatibility), but you should convert to use the
new form, as likely you aren't using the old one safely.

There are also two new functions, L<perlapi/C<strict_utf8_to_uv>> and
L<perlapi/C<c9strict_utf8_to_uv>> which do the same thing except when
the input string represents a code point that Unicode doesn't accept as
legal for interchange, using either the strict original definition
(C<strict_utf8_to_uv>), or the looser one given by
L<Unicode Corrigendum #9|https://www.unicode.org/versions/corrigendum9.html>
(C<c9strict_utf8_to_uv>). When the input string represents one of the
restricted code points, these functions return the Unicode
C<REPLACEMENT CHARACTER> instead.

Also L<perlapi/C<extended_utf8_to_uv>> is a synonym for C<utf8_to_uv>, for use
when you want to emphasize that the entire range of Perl extended UTF-8
is acceptable.

There are also replacement functions for the three more specialized
conversion functions that you are unlikely to need to use. Again, the
old forms are kept for backwards compatibility, but you should convert
to use the new forms.

L<perlapi/C<utf8_to_uv_flags>> replaces L<perlapi/C<utf8n_to_uvchr>>.

L<perlapi/C<utf8_to_uv_errors>> replaces L<perlapi/C<utf8n_to_uvchr_error>>.

L<perlapi/C<utf8_to_uv_msgs>> replaces
L<perlapi/C<utf8n_to_uvchr_msgs>>.

=item *

Three new API functions are introduced to convert strings encoded in
UTF-8 to native bytes format (if possible). These are easier to use
than the existing ones, and they avoid unnecessary memory allocations.
Expand Down
6 changes: 3 additions & 3 deletions utf8.c
Original file line number Diff line number Diff line change
Expand Up @@ -1065,20 +1065,20 @@ syntactically invalid UTF-8.
=over 4
=item C<strict_utf8_to_uv>
=item * C<strict_utf8_to_uv>
additionally rejects any UTF-8 that translates into a code point that isn't
specified by Unicode to be freely exchangeable, namely the surrogate characters
and non-character code points (besides non-Unicode code points, any above
0x10FFFF). It does not raise a warning when rejecting.
=item C<c9strict_utf8_to_uv>
=item * C<c9strict_utf8_to_uv>
instead uses the exchangeable definition given by Unicode's Corregendum #9,
which accepts non-character code points while still rejecting surrogates. It
does not raise a warning when rejecting.
=item C<extended_utf8_to_uv>
=item * C<extended_utf8_to_uv>
accepts all syntactically valid UTF-8, as extended by Perl to allow 64-bit code
points to be encoded.
Expand Down

0 comments on commit cffb5af

Please sign in to comment.