-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] StringHelper: check if string as utf8 chars before doing utf8 conversion? #19
Comments
I can't say I've benchmarked any of Joomla's code in quite some time, so in places where we can make improvements, I'm all for it. One thing to keep in mind with the phputf8 library. Even though that version of it has been long abandoned, we've still treated it as an external resource and never tried to fork/patch it. There's a part of me that wants to sit down and look at maintained alternatives to replace it, or one day bite the bullet and just do a hard fork (including pulling the tests out of http://phputf8.cvs.sourceforge.net/viewvc/phputf8/utf8/ for the library code). |
@andrepereiradasilva I had done some of those performance tests for the current CMS and came to about the same conclusions as you. This is a good step forward in performance and if I am not mistaken, I removed some stringhelper calls where it was sure that no utf-8 was involved. Also, in my opinion, the next major version of Joomla should not use the phputf8 library. It should require the use of mbstring. Most - if not all - hosters already support the use of mbstring. It is simple. we would gain a lot of performance by not having to call stringhelper methods which in turn have to find out if mbstring is loaded then call that or phputf8 etc, etc. |
if everything ok, don't need to change phputf8 library to make a change like this. just something like this at the start in the StringHelper methods would do it if (self::is_ascii($str) === true)
{
return php_native_method($str, [...]);
} |
If we can get away with it, cool. https://github.com/symfony/polyfill-mbstring exists so even if we go for "require mbstring" to some extent we can avoid making it a "hard" requirement. |
Where to propose such "drastic" change? |
Yes, I think that would be ok. |
Prepare a patch, do some testing to ensure it is suitable for our needs, and open a PR. George and I can handle getting attention from the right people as required. |
ok so i solved the issue (some test code i left behind ...) so some results where not correct. Updated them. |
@frankmayer please note that for some tests even using native mb_* seems to be better with checking utf8 before strtolower 1000 tests
|
@andrepereiradasilva yes, as long as the overall stringhelper overhead is not too expensive that should be a lot faster. Did you do these (mb_strtolower() vs strtolower() ) tests by invoking the stringhelper methods or by themselves (with or without checking if is_ascii)? |
also shouldn't it also take in considerantion that if locale is utf-8, not an expert in this charset things so could be missing something, but seems there's even no need for those mb_* in those cases $currentLocale = setlocale(LC_CTYPE, 0);
setlocale(LC_CTYPE, 'en_GB');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
setlocale(LC_CTYPE, 'en_GB.UTF-8');
echo strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
echo mb_strtolower('Τάχιστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός') . PHP_EOL;
setlocale(LC_CTYPE, $currentLocale); Results in:
|
@mbabker thanks. Will take a shot at that, then. |
Both. the code used in there |
Not sure if this works in all cases (languages and also methods) correctly, though. To be sure, I have always used mbstrings for international chars. |
ok, but it's a lot faster not even using mb_* in those cases for UTF-8 strings. that i guess are most linux installations. strtolower 1000 tests
|
If this is doable and cross platform it would be great! But my point was that it might work with some methods and not work with others. Same for languages. |
right, even if it's not possible in Mac/windows, if linux, as it seems, does this conversion when locale is utf-8 if should be no need to make these mb_* overhead in those systems.
of course |
i see what you mean now. the setlocale is not trusty. just noticed in the test above of strtolower the first letter is still an uppercase with the utf-8 locale. so forget it. |
We're preparing for replacing phputf8 (see #37). After that replacement, new benchmarks might show, if this optimisation still makes sense. |
Steps to reproduce the issue
Was testing the performance of StringHelper calls.
For what i understand they are mostly a way to make native php functions utf8 compatible.
But this makes some performance overhead since even pure ascii strings will try to do the utf8 conversion.
So, made some tests with 1000 iterations in 3 scenarios:
utf8_decode($str) === $str
(or evenStringHelper::is_ascii($str) === true
) to decide if it should use the StringHelper method or the php native methodUsed to strings to test:
Results on some methods
strpos tests
strtoupper tests
strrev tests
Remarks
It's clear that the
Utf8CheckBefore
method is much faster in plain ascii strings. And a little slower in strings with utf-8 chars.Since many of the StringHelper usage across the cms (and overall) should be plain ascii shouldn't be better to check if the strign as utf-8 chars for deciding if it should use utf8 methods or not?
System information (as much as possible)
String 2.0
PHP 7.0.19 with mbstring extension (without is even slower)
Additional comments
@frankmayer would aprreciate your contribution here too
The text was updated successfully, but these errors were encountered: