Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new function bytes_to_utf8_free_me #22823

Open
wants to merge 2 commits into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

@khwilliamson khwilliamson commented Dec 5, 2024

This is like bytes_to_utf8, but if the representation of the input string is the same in UTF-8 as it is in native format, the allocation of new memory is skipped.

This presents optimization possibilities.

Suggestions for a better name are welcome

  • This set of changes requires a perldelta entry, to be furnished

utf8.c Outdated

const U8 * const send = s + *lenp;
Size_t variant_count = variant_under_utf8_count(s, send);
if (free_me_ptr != NULL && variant_count == 0 && s[*lenp-1] == '\0') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it looks like *lenp includes the terminating NUL, but below it doesn't.

Consider when *lenp starts as 1, this expects s[0] to be NUL which doesn't match what I expect from this API.

That said, we would be assuming that s[*lenp] is valid, s+*lenp would always be valid as a one-past-the-end pointer, but such a pointer cannot be dereferenced.

So that s[*lenp] is safe becomes a pre-condition when free_me_ptr isn't NULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this comment. I don't see how I'm dereferencing s[*lenp]. I am dereferencing s[*lenp - 1].

The input is not required to be NUlL terminated, but the output is. So if *lenp is 1, it looks at s[0]. If that is NUL, the function returns s unchanged, as it is a NUL-terminated string whose representation doesn't change when encoded in UTF-8. If it isn't a NUL, the function allocates new memory that includes whatever byte is in s[0] and appends a NUL to it.

In re-reading the code, I see I failed to consider the possibility that *lenp is 0, and that I might be overallocating the new memory by 1 byte. And I did a bit more clean up, so I dereference instead *(send -1). And I think it is better to dereference a pointer once into a local variable, rather than to dereference it multiple times

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no tests or example code so it's hard to tell how it's meant to be called.

Let's say I call:

Size_t len = 10; /* does not count the NUL, is that typical/expected? */
const U8 *free_me;
const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me);
...
Safefree(free_me);

As the code is written now, this will always allocate a new string, but if I call it with:

Size_t len = 11; /* does count the NUL */
const U8 *free_me;
const U8 *result = bytes_to_utf8_free_me("0123456789", &len, &free_me);
...
Safefree(free_me);

result will be a pointer to the string passed in, and free_me will be NULL.

Is including the NUL in the count passed in the intended way to call this function?

Note that if you do expect that, then when a string is allocated due to variants the resulting string will have double NUL termination, which is a bit unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you reviewed this before refreshing with the latest version available at the time. I had already noticed the double NUL and fixed it.

I don't know what to do about the length disparity. If you include the NUL in the length in blead, you will get a double NUL. In order for the new form to know that there is a trailing NUL, it has to be able examine that byte, and so the length has to include it. I added a paragraph to the pod explaining it. (hopefully)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has since occurred to me that it might be better to just not make the guarantee of a trailing NUL if there is no other reason to allocate new memory.

@khwilliamson khwilliamson force-pushed the bytes_to_utf8 branch 2 times, most recently from ce98a9c to 5d31895 Compare December 16, 2024 03:50
This is like bytes_to_utf8, but if the representation of the input
string is the same in UTF-8 as it is in native format, the allocation of
new memory is skipped.

This presents optimization possibilities.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants