Skip to content

Commit

Permalink
Add utf8_to_uv_or_die()
Browse files Browse the repository at this point in the history
This new function dies if the UTF-8 input to it is malformed.  There are
quite a few places in the core where we expect the input to be
wellformed, and just assume that it is.  This function is a drop-in
replacement for those, and we won't blindly continue if the assumption
is wrong.  There are also a bunch of places that don't make that
assumption, but check it and die immediately if malformed.  This
function replaces those too, along with the code needed to test the
return and die.
  • Loading branch information
khwilliamson committed Dec 16, 2024
1 parent c60ca96 commit 15d8147
Show file tree
Hide file tree
Showing 6 changed files with 54 additions and 12 deletions.
4 changes: 4 additions & 0 deletions embed.fnc
Original file line number Diff line number Diff line change
Expand Up @@ -3780,6 +3780,10 @@ CTp |bool |utf8_to_uv_msgs_helper_ \
|U32 flags \
|NULLOK U32 *errors \
|NULLOK AV **msgs
ATdip |UV |utf8_to_uv_or_die \
|NN const U8 * const s \
|NN const U8 *e \
|NULLOK Size_t *advance_p
CDbdp |UV |utf8_to_uvuni |NN const U8 *s \
|NULLOK STRLEN *retlen
: Used in perly.y
Expand Down
1 change: 1 addition & 0 deletions embed.h
Original file line number Diff line number Diff line change
Expand Up @@ -870,6 +870,7 @@
# define utf8_to_uv_flags Perl_utf8_to_uv_flags
# define utf8_to_uv_msgs Perl_utf8_to_uv_msgs
# define utf8_to_uv_msgs_helper_ Perl_utf8_to_uv_msgs_helper_
# define utf8_to_uv_or_die Perl_utf8_to_uv_or_die
# define utf8n_to_uvchr Perl_utf8n_to_uvchr
# define utf8n_to_uvchr_error Perl_utf8n_to_uvchr_error
# define utf8n_to_uvchr_msgs Perl_utf8n_to_uvchr_msgs
Expand Down
10 changes: 10 additions & 0 deletions inline.h
Original file line number Diff line number Diff line change
Expand Up @@ -3138,6 +3138,16 @@ Perl_utf8_to_uv_msgs(const U8 * const s0,
return utf8_to_uv_msgs_helper_(s0, e, cp_p, advance_p, flags, errors, msgs);
}

PERL_STATIC_INLINE UV
Perl_utf8_to_uv_or_die(const U8 *s, const U8 *e, STRLEN *advance_p)
{
PERL_ARGS_ASSERT_UTF8_TO_UV_OR_DIE;

UV cp;
(void) utf8_to_uv_flags(s, e, &cp, advance_p, UTF8_DIE_IF_MALFORMED);
return cp;
}

PERL_STATIC_INLINE UV
Perl_utf8n_to_uvchr_msgs(const U8 * const s0,
STRLEN curlen,
Expand Down
7 changes: 4 additions & 3 deletions pod/perldelta.pod
Original file line number Diff line number Diff line change
Expand Up @@ -436,9 +436,10 @@ New API functions are introduced to convert strings encoded in UTF-8 to
their ordinal code point equivalent. These are safe to use by default,
and generally more convenient to use than the existing ones.

L<perlapi/C<utf8_to_uv>> replaces L<perlapi/C<utf8_to_uvchr>> (which is
retained for backwards compatibility), but you should convert to use the
new form, as likely you aren't using the old one safely.
L<perlapi/C<utf8_to_uv>> and L<perlapi/C<utf8_to_uv_or_die>> replace
L<perlapi/C<utf8_to_uvchr>> (which is retained for backwards
compatibility), but you should convert to use the new forms, as likely
you aren't using the old one safely.

To convert in the opposite direction, you can now use
L<perlapi/C<uv_to_utf8>>. This is not a new function, but a new synonym
Expand Down
5 changes: 5 additions & 0 deletions proto.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

39 changes: 30 additions & 9 deletions utf8.c
Original file line number Diff line number Diff line change
Expand Up @@ -1003,6 +1003,7 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
=for apidoc_item extended_utf8_to_uv
=for apidoc_item strict_utf8_to_uv
=for apidoc_item c9strict_utf8_to_uv
=for apidoc_item utf8_to_uv_or_die
=for apidoc_item utf8_to_uvchr_buf
=for apidoc_item utf8_to_uvchr
Expand Down Expand Up @@ -1099,6 +1100,11 @@ sequence. You can use that function or C<L</utf8_to_uv_flags>> to exert more
control over the input that is considered acceptable, and the warnings that are
raised.
C<utf8_to_uv_or_die> has a simpler interface, for use when any errors are
fatal. It returns the code point instead of using an output parameter, and
throws an exception with any errors found where the other functions here would
have returned false.
Often, C<s> is an arbitrarily long string containing the UTF-8 representations
of many code points in a row, and these functions are called in the course of
parsing C<s> to find all those code points.
Expand All @@ -1107,8 +1113,8 @@ If your code doesn't know how to deal with illegal input, as would be typical
of a low level routine, the loop could look like:
while (s < e) {
UV cp;
Size_t advance;
UV cp;
(void) utf8_to_uv(s, e, &cp, &advance);
<handle 'cp'>
s += advance;
Expand All @@ -1118,11 +1124,24 @@ A REPLACEMENT CHARACTER will be inserted everywhere that malformed input
occurs. Obviously, we aren't expecting such outcomes, but your code will be
protected from attacks and many harmful effects that could otherwise occur.
If the situation is such that it would be a bug for the input to be invalid, a
somewhat simpler loop suffices:
while (s < e) {
Size_t advance;
UV cp = utf8_to_uv_or_die(s, e, &advance);
<handle 'cp'>
s += advance;
}
This will throw an exception on invalid input, so your code doesn't have to
concern itself with that possibility.
If you do have a plan for handling malformed input, you could instead write:
while (s < e) {
UV cp;
Size_t advance;
UV cp;
if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) {
<bail out or convert to handleable>
Expand All @@ -1142,9 +1161,10 @@ attacks against such code; and it is extra work always, as the functions have
already done the equivalent work and return the correct value in C<advance>,
regardless of whether the input is well-formed or not.
You must always pass a non-NULL pointer into which to store the (first) code
point C<s> represents. If you don't care about this value, you should be using
one of the C<L</isUTF8_CHAR>> functions instead.
Except with C<utf8_to_uv_or_die>, you must always pass a non-NULL pointer into
which to store the (first) code point C<s> represents. If you don't care about
this value, you should be using one of the C<L</isUTF8_CHAR>> functions
instead.
=item C<utf8_to_uvchr> forms
Expand Down Expand Up @@ -1274,8 +1294,8 @@ This flag is ignored if C<UTF8_CHECK_ONLY> is also set.
=item C<UTF8_WARN_SURROGATE>
These reject and/or warn about UTF-8 sequences that represent surrogate
characters. The warning categories C<utf8> and C<super> control if warnings
are actually raised.
characters. The warning categories C<utf8> and C<non_unicode> control if
warnings are actually raised.
=item C<UTF8_DISALLOW_NONCHAR>
Expand All @@ -1290,7 +1310,7 @@ are actually raised.
=item C<UTF8_WARN_SUPER>
These reject and/or warn about UTF-8 sequences that represent code points
above 0x10FFFF. The warning categories C<utf8> and C<super> control if
above 0x10FFFF. The warning categories C<utf8> and C<non_unicode> control if
warnings are actually raised.
=item C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>
Expand Down Expand Up @@ -1324,7 +1344,8 @@ These reject and/or warn on encountering sequences that require Perl's
extension to UTF-8 to represent them. These are all for code points above
0x10FFFF, so these sequences are a subset of the ones controlled by SUPER or
either of the illegal interchange sets of flags. The warning categories
C<utf8>, C<super>, and C<portable> control if warnings are actually raised.
C<utf8>, C<non_unicode>, and C<portable> control if warnings are actually
raised.
Perl predates Unicode, and earlier standards allowed for code points up through
0x7FFF_FFFF (2**31 - 1). Perl, of course, would like you to be able to
Expand Down

0 comments on commit 15d8147

Please sign in to comment.