Add new function bytes_to_utf8_free_me #22823

khwilliamson · 2024-12-05T00:22:59Z

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped.

This presents optimization possibilities.

Suggestions for a better name are welcome

This set of changes requires a perldelta entry, to be furnished

tonycoz · 2024-12-10T02:58:42Z

utf8.c

+
+    const U8 * const send = s + *lenp;
+    Size_t variant_count = variant_under_utf8_count(s, send);
+    if (free_me_ptr != NULL && variant_count == 0 && s[*lenp-1] == '\0') {


Here it looks like *lenp includes the terminating NUL, but below it doesn't.

Consider when *lenp starts as 1, this expects s[0] to be NUL which doesn't match what I expect from this API.

That said, we would be assuming that s[*lenp] is valid, s+*lenp would always be valid as a one-past-the-end pointer, but such a pointer cannot be dereferenced.

So that s[*lenp] is safe becomes a pre-condition when free_me_ptr isn't NULL.

I'm trying to understand this comment. I don't see how I'm dereferencing s[*lenp]. I am dereferencing s[*lenp - 1].

The input is not required to be NUlL terminated, but the output is. So if *lenp is 1, it looks at s[0]. If that is NUL, the function returns s unchanged, as it is a NUL-terminated string whose representation doesn't change when encoded in UTF-8. If it isn't a NUL, the function allocates new memory that includes whatever byte is in s[0] and appends a NUL to it.

In re-reading the code, I see I failed to consider the possibility that *lenp is 0, and that I might be overallocating the new memory by 1 byte. And I did a bit more clean up, so I dereference instead *(send -1). And I think it is better to dereference a pointer once into a local variable, rather than to dereference it multiple times

There's no tests or example code so it's hard to tell how it's meant to be called.

Let's say I call:

Size_t len = 10; /* does not count the NUL, is that typical/expected? */ const U8 *free_me; const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me); ... Safefree(free_me);

As the code is written now, this will always allocate a new string, but if I call it with:

Size_t len = 11; /* does count the NUL */ const U8 *free_me; const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me); ... Safefree(free_me);

result will be a pointer to the string passed in, and free_me will be NULL.

Is including the NUL in the count passed in the intended way to call this function?

Note that if you do expect that, then when a string is allocated due to variants the resulting string will have double NUL termination, which is a bit unexpected.

I think you reviewed this before refreshing with the latest version available at the time. I had already noticed the double NUL and fixed it.

I don't know what to do about the length disparity. If you include the NUL in the length in blead, you will get a double NUL. In order for the new form to know that there is a trailing NUL, it has to be able examine that byte, and so the length has to include it. I added a paragraph to the pod explaining it. (hopefully)

It has since occurred to me that it might be better to just not make the guarantee of a trailing NUL if there is no other reason to allocate new memory.

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped. This presents optimization possibilities.

I think this may be a better option.

tonycoz reviewed Dec 10, 2024

View reviewed changes

khwilliamson force-pushed the bytes_to_utf8 branch 2 times, most recently from ce98a9c to 5d31895 Compare December 16, 2024 03:50

tonycoz approved these changes Dec 17, 2024

View reviewed changes

khwilliamson added 2 commits December 18, 2024 14:06

Add new function bytes_to_utf8_free_me

528340c

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped. This presents optimization possibilities.

Squash Change bytes_to_utf8_free_me to not look for trailing NUL

4a3e5ea

I think this may be a better option.

khwilliamson force-pushed the bytes_to_utf8 branch from 5d31895 to 4a3e5ea Compare December 18, 2024 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new function bytes_to_utf8_free_me #22823

Add new function bytes_to_utf8_free_me #22823

khwilliamson commented Dec 5, 2024 •

edited

Loading

tonycoz Dec 10, 2024

khwilliamson Dec 10, 2024

tonycoz Dec 11, 2024

khwilliamson Dec 16, 2024

khwilliamson Dec 16, 2024

Add new function bytes_to_utf8_free_me #22823

Are you sure you want to change the base?

Add new function bytes_to_utf8_free_me #22823

Conversation

khwilliamson commented Dec 5, 2024 • edited Loading

tonycoz Dec 10, 2024

Choose a reason for hiding this comment

khwilliamson Dec 10, 2024

Choose a reason for hiding this comment

tonycoz Dec 11, 2024

Choose a reason for hiding this comment

khwilliamson Dec 16, 2024

Choose a reason for hiding this comment

khwilliamson Dec 16, 2024

Choose a reason for hiding this comment

khwilliamson commented Dec 5, 2024 •

edited

Loading