Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] StringHelper: check if string as utf8 chars before doing utf8 conversion? #19

Open
andrepereiradasilva opened this issue Jun 26, 2017 · 19 comments

Comments

@andrepereiradasilva
Copy link

andrepereiradasilva commented Jun 26, 2017

Steps to reproduce the issue

Was testing the performance of StringHelper calls.
For what i understand they are mostly a way to make native php functions utf8 compatible.

But this makes some performance overhead since even pure ascii strings will try to do the utf8 conversion.

So, made some tests with 1000 iterations in 3 scenarios:

  • Current: using StringHelper method
  • Utf8CheckBefore: just use utf8_decode($str) === $str (or even StringHelper::is_ascii($str) === true) to decide if it should use the StringHelper method or the php native method
  • Native: PHP native method (with utf8 issues)

Used to strings to test:

  • ASCII String: "Without utf-8 chars"
  • UTF-8 String: "With utf-8 chars 纹身馆简介你好"

Results on some methods

strpos tests

> ASCII String: [Current: 1.308 ms | Utf8CheckBefore:  0.740 ms | Native: 0.140 ms]
> UTF-8 String: [Current: 1.313 ms | Utf8CheckBefore:  2.268 ms | Native: 0.132 ms]

strtoupper tests

> ASCII String [Current: 6.142 ms | Utf8CheckBefore:  0.339 ms | Native: 0.080 ms]
> UTF-8 String: [Current: 6.977 ms | Utf8CheckBefore:  7.769 ms | Native: 0.103 ms]

strrev tests

> ASCII String [Current: 3.864 ms | Utf8CheckBefore:  0.378 ms  | Native: 0.095 ms]
> UTF-8 String: [Current: 4.524 ms | Utf8CheckBefore:  5.275 ms  | Native: 0.111 ms]

Remarks

It's clear that the Utf8CheckBefore method is much faster in plain ascii strings. And a little slower in strings with utf-8 chars.
Since many of the StringHelper usage across the cms (and overall) should be plain ascii shouldn't be better to check if the strign as utf-8 chars for deciding if it should use utf8 methods or not?

System information (as much as possible)

String 2.0
PHP 7.0.19 with mbstring extension (without is even slower)

Additional comments

@frankmayer would aprreciate your contribution here too

@mbabker
Copy link
Contributor

mbabker commented Jun 26, 2017

I can't say I've benchmarked any of Joomla's code in quite some time, so in places where we can make improvements, I'm all for it.

One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it. There's a part of me that wants to sit down and look at maintained alternatives to replace it, or one day bite the bullet and just do a hard fork (including pulling the tests out of http://phputf8.cvs.sourceforge.net/viewvc/phputf8/utf8/ for the library code).

@frankmayer
Copy link

@andrepereiradasilva I had done some of those performance tests for the current CMS and came to about the same conclusions as you. This is a good step forward in performance and if I am not mistaken, I removed some stringhelper calls where it was sure that no utf-8 was involved.

Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring.

It is simple. we would gain a lot of performance by not having to call stringhelper methods which in turn have to find out if mbstring is loaded then call that or phputf8 etc, etc.

@andrepereiradasilva
Copy link
Author

andrepereiradasilva commented Jun 26, 2017

One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it.

if everything ok, don't need to change phputf8 library to make a change like this.

just something like this at the start in the StringHelper methods would do it

if (self::is_ascii($str) === true)
{
    return php_native_method($str, [...]);
}

@mbabker
Copy link
Contributor

mbabker commented Jun 26, 2017

Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring.

If we can get away with it, cool. https://github.com/symfony/polyfill-mbstring exists so even if we go for "require mbstring" to some extent we can avoid making it a "hard" requirement.

@frankmayer
Copy link

Where to propose such "drastic" change?

@frankmayer
Copy link

@andrepereiradasilva

if everything ok, don't need to change phputf8 library to make a change like this.

just something like this at the start in the StringHelper methods would do it

if (self::is_ascii($str) === true)
{
return php_native_method($str, [...]);
}

Yes, I think that would be ok.

@mbabker
Copy link
Contributor

mbabker commented Jun 26, 2017

Where to propose such "drastic" change?

Prepare a patch, do some testing to ensure it is suitable for our needs, and open a PR. George and I can handle getting attention from the right people as required.

@andrepereiradasilva
Copy link
Author

ok so i solved the issue (some test code i left behind ...) so some results where not correct. Updated them.

@andrepereiradasilva
Copy link
Author

andrepereiradasilva commented Jun 26, 2017

@frankmayer please note that for some tests even using native mb_* seems to be better with checking utf8 before

strtolower 1000 tests

ASCII String

  • [11.345 ms] StringHelper::strtolower($str)
  • [0.739 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
  • [0.726 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
  • [11.317 ms] mb_strtolower($str)
  • [0.157 ms] strtolower($str)

UTF-8 String

  • [13.100 ms] StringHelper::strtolower($str)
  • [13.813 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
  • [13.383 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
  • [12.788 ms] mb_strtolower($str)
  • [0.214 ms] strtolower($str)

@frankmayer
Copy link

@andrepereiradasilva yes, as long as the overall stringhelper overhead is not too expensive that should be a lot faster. Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)?

@andrepereiradasilva
Copy link
Author

also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases

$currentLocale = setlocale(LC_CTYPE, 0);

setlocale(LC_CTYPE, 'en_GB');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;

setlocale(LC_CTYPE, 'en_GB.UTF-8');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;

setlocale(LC_CTYPE, $currentLocale);

Results in:

������� ������ ����� ������ ��, ����������� ���� ������ �����
τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός
τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός

@frankmayer
Copy link

@mbabker thanks. Will take a shot at that, then.

@andrepereiradasilva
Copy link
Author

Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)?

Both. the code used in there

@frankmayer
Copy link

@andrepereiradasilva

also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases

Not sure if this works in all cases (languages and also methods) correctly, though. To be sure, I have always used mbstrings for international chars.

@andrepereiradasilva
Copy link
Author

andrepereiradasilva commented Jun 26, 2017

ok, but it's a lot faster not even using mb_* in those cases for UTF-8 strings. that i guess are most linux installations.

strtolower 1000 tests

ASCII String

  • [11.345 ms] StringHelper::strtolower($str)
  • [0.739 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
  • [0.726 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
  • [0.794 ms] utf8_decode($str) === $str || strpos(setlocale(LC_CTYPE, 0), 'UTF-8') !== false ? strtolower($str) : mb_strtolower($str);
  • [11.317 ms] mb_strtolower($str)
  • [0.157 ms] strtolower($str)

UTF-8 String

  • [13.100 ms] StringHelper::strtolower($str)
  • [13.813 ms] utf8_decode($str) === $str ? strtolower($str) : StringHelper::strtolower($str);
  • [13.383 ms] utf8_decode($str) === $str ? strtolower($str) : mb_strtolower($str);
  • [1.506 ms] utf8_decode($str) === $str || strpos(setlocale(LC_CTYPE, 0), 'UTF-8') !== false ? strtolower($str) : mb_strtolower($str);
  • [12.788 ms] mb_strtolower($str)
  • [0.214 ms] strtolower($str)

@frankmayer
Copy link

If this is doable and cross platform it would be great!
But, you are not dealing with linux only and also make sure that PHP uses the correct locale (Might be more than one installed) 😸

But my point was that it might work with some methods and not work with others. Same for languages.
This would be needing a lot of testing to make sure, nothing breaks.
But, yes, it would be a great performance improvement.

@andrepereiradasilva
Copy link
Author

But, you are not dealing with linux only and also make sure that PHP uses the correct locale (Might be more than one installed) 😸

right, even if it's not possible in Mac/windows, if linux, as it seems, does this conversion when locale is utf-8 if should be no need to make these mb_* overhead in those systems.

But my point was that it might work with some methods and not work with others. Same for languages.
This would be needing a lot of testing to make sure, nothing breaks.

of course

@andrepereiradasilva
Copy link
Author

i see what you mean now. the setlocale is not trusty. just noticed in the test above of strtolower the first letter is still an uppercase with the utf-8 locale. so forget it.

@nibra
Copy link
Contributor

nibra commented Sep 21, 2021

We're preparing for replacing phputf8 (see #37). After that replacement, new benchmarks might show, if this optimisation still makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants