[4.0] Using language specific tokeniser and stemmer for com_finder #20391

Hackwar · 2018-05-14T09:20:10Z

Tokenisation

In order to build the index, com_finder has to tokenise the input from a string into single terms. Right now com_finder here basically only really supports latin based languages and chinese (by a hack). Languages that do not seperate words by white space are not supported. This PR introduces a FinderIndexerLanguage class, which tokenises strings and which can be overridden by language specific tokenisers, so that all languages can be supported. The different tokenisation scripts would have to be supported by the translation teams.

Stemming

com_finder also supports stemming words in order to improve the search results. Stemming reduces a term to its base and thus allows for searching both singular and plural variants of a word with the same search term or different timecases of a word. Unfortunately com_finder up till now only had 3 stemmers, which supported only a small range of languages. Again, this has been outsourced into a file in the language pack.

Providing language specific tokeniser and stemmer

Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.

How to test

Unfortunately this is a pretty opaque change as long as there is only the en-GB implementation of the language support file. Simply said, you should be able to take a current 4.0-dev, test indexing and searching, apply these changes and do the same tests, seeing that you got the same results.

Hackwar · 2018-05-14T09:21:12Z

Should we rename methods named "tokenize" to "tokenise" while we are at it?

infograf768 · 2018-05-14T09:31:39Z

Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.

Where is that en-GB example? What will happen when a language has no language.finder.php

What happens when a site has a mix of languages in its contents? or even is pure multilingual?

Hackwar · 2018-05-14T09:39:50Z

The en-GB example is in /language/en-GB/en-GB.finder.php and translation teams would have to add their version of that file in that same place in their language folder in their language.

When a language does not have such a file, it will fall back on the latin based tokenisation that we are using right now. So tokenisation only has to be overriden for languages that will not work with the white-space-based tokenisation. Stemming is language specific and if not implemented, will only return the original term instead. So translation teams don't HAVE to provide such a file, but their language would benefit from this.

If a site has several content items with different languages, it will choose the right tokeniser and stemmer based on the language of that content item. If people mix languages in a content item that has been marked with a specific language, it will use that tokeniser (which normally shouldn't be an issue) and that stemmer. (What happens then would be dependent on the stemmer implementation. Normally that isn't a problem. In worst case you don't get stemmed words and thus can only search for the specific words.) If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled.

infograf768 · 2018-05-14T09:41:43Z

I hope you are also going to propose the stemmers as I doubt any TT would be able to code this. I include myself in that group.

Hackwar · 2018-05-14T09:52:03Z

Here is a good implementation of stemmers for several languages: https://github.com/wamania/php-stemmer
That should cover all languages that we are supporting today and a few more. There are more stemmers out there that can be used.

brianteeman · 2018-05-14T10:20:35Z

if i read the changes correctly then this PR removes existing stemmers eg for french

Bakual · 2018-05-14T10:26:54Z

I see two issues:
First is like JM said that no TT will be able to write that, even if example code is there. TTs will not understand code which hardly any Joomla contributor understands 😄
Second is that you introduce a new PHP file in the language directory. I would rather love to see us getting rid of PHP files and use different technics if somehow possible. Because translation tools in general struggle to deal with that.
In any case, it would imho make more sense to move the function into the existing localise.php file and use existing workflows to deal with it.

The idea of this PR is certainly good and valid, but the proposal in its current state isn't going to work out well in practice simply due to translaters not being coders.

Hackwar · 2018-05-14T10:55:47Z

@brianteeman Yes, this removes the old stemmers, since the translation packages need to provide those. Up to this PR, we effectively only supported english and french, with this PR these at least can be extended for every language out there.

@Bakual The stemmer (and in parts the tokeniser) are depending on the respective language and unless we want to go the same way like we did in the past and only support X languages, this has to be made modular and be installed by default when users install a language.

Regarding the coding of a stemmer: Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them. As I said, there are stemmers for most languages out there. I'd be happy to provide translators with the necessary code for at least all the stemmers that the project that I linked above provides. That would bring us from 2 to 12 supported languages. Also keep in mind that stemming is not a requirement, but an optional feature.

Stemmer and tokeniser for a specific component don't have any business in a global localisation file. (Yes, we should remove the methods that are com_search specific from those localise.php files.) If you have a better solution where to put the code, then I'm all ears, but it has to be part of the translation/localisation package, otherwise the localisation simply wouldn't be complete. (And again, this file is not mandatory, but an optional feature.)

Bakual · 2018-05-14T11:21:41Z

Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them.

If we (read "You") have to write them anyway, we can as well just put them into core. Same result in the end.
The feature is nice, but a feature that never gets used because nobody will be able to use it, is useless.

this file is not mandatory, but an optional feature

Theoretically yes. But the file will be on Crowdin and a language will only ever be 100% translated if that file is also "translated". And it has to be working code or the language pack will be broken.
So in practice it will be a mandatory file for translation which nobody is able to "translate".

Again, I think the idea is nice and in theory everything is fine. It's the practice where it will fail.

infograf768 · 2018-05-14T11:59:03Z

As I said, there are stemmers for most languages out there.

I hardly can consider most languages there in https://github.com/wamania/php-stemmer#languages (12 only as of today). Only one using non-latin alphabet.
Looks also that the code there would have to be modified to fit (as far as I can see by comparing with the en-GB.finder.php included here).
As for white-space-tokenisation I see no language there who do not use the white space and none at all using multibyte spaces. What happens with Chinese for example?

I certainly would love it if this was practically feasible for the majority of J languages and, as @Bakual rightfully says, independent from lang packs.

brianteeman · 2018-05-14T12:00:46Z

although to be fair the current codebase doesnt support those languages either

Hackwar · 2018-05-14T12:30:17Z

Sorry, I don't get you guys. As brian wrote, right now we don't support none-white-space languages and in terms of stemming, we only support english and french and there is no way to fix that. Now I was so foolish to provide a solution that allows translation packs to provide the best (optional) solution for their language for both situations and you are shooting this down because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there.
If you are worried about not having the nice "100% complete" behind the language, then people can simply put an empty class in the file if you really need that file to get to a 100%.
Yes, that one project "only" supports 12 stemmers. There are more projects out there than that one. And again: If there is no stemming algorithm out there for a language, then there is nothing that we can do. But right now I really would like to be able to use com_finder with stemming on a german, spanish or russian website.

Bakual · 2018-05-14T13:53:24Z

I'm not shooting it down. Again I think it's a valid idea and I like the idea. I'm just saying that in its current form it will not work out as expected because it will not work in practice.

because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there.

That point was to make it clear the current proposal wasn't going to work. Because I'm absolutely clear that you will not provide this, And Michael will also not do that. And then we've run out of developers who understand the stemmer.

If you are worried about not having the nice "100% complete" behind the language, then people can simply put an empty class in the file if you really need that file to get to a 100%.

It's not about reaching an arbitary label. It's important to have a language at 100%. If we ever want to automate something around language packs, this will trigger it (currently it already triggers the building of the pack).
Having to create an empty file so our prcoess works sounds quite stupid. And even creating that empty file is not that easy in Crowdin.

I'm not trying to shoot it down. I'm trying to get it improved somehow. I just don't know how yet.

infograf768 · 2018-05-14T15:05:30Z

@Hackwar

If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled.

I do not understand this. As you should know, when multilingual is not implemented in 4.0, then all content items are marked as ‘*’.

Does that mean that the stemmers would only work when the site is muultilingual?
I am confused

ReLater · 2018-05-14T17:14:29Z

There are so many changes in J4 where I say only some isolated programmers and experts understand them and that others (also extension providers!) can*t adapt/use them in the future without learning/understanding lots of completely new stuff ... ... or have to quit ... ... or have to fall back into Joomla3 times and techniques.

Thus, for me arguments like "poor supporters won't understand how to" are not really comprehensible here.

BTW: Nearly nobody understood the stemmer files before this pr, too. Not to mention how to integrate them. That's the reason why we only have 2 language specific stemmers for years now in Joomla ;-)

ggppdk · 2018-05-15T06:49:50Z

This PR looks like a major improvement, that will be usable not only in com_finder but also elsewhere (by 3rd party too !!)

How does the current solution make it any "easier" for translation teams to provide a word stemmer ?
Also what would an "easier" solution be ?

Stemmers need to be coded, right ? there will never be an easy solution
So i don't understand the "difficult" argument

Also why there is need to involve the translation teams ?
12 language packs will just call the 3rd party code: e.g.
https://github.com/wamania/php-stemmer
which we should include in core libraries

And any other language packs that want this ... will need a programmer to add it,
and then the translation teams will never touch it, as they will be told that it is file that can only be done by a programmer

because stemmer (and tokeniser), are not like language strings that are frequently updated

Bakual · 2018-05-15T12:38:29Z

And any other language packs that want this ... will need a programmer to add it,
and then the translation teams will never touch it, as they will be told that it is file that can only be done by a programmer

It's one thing to write the stemmer. Assuming the team finds a developer who can write it (doubtfully), we still can't integrate the file into Crowdin. Which means our workflow (as decided by OSM) will be broken. We would need to host that file somewhere on GitHub and have an own build process for the language packs, while currently the packs are created by Crowdin. If someone wants to write a custom solution for this, we can talk further. Or maybe someone else has an idea how we can include an optional code (pseudo-translation) file in Crowdin.

infograf768 · 2018-05-15T13:37:40Z

@Bakual
as far as I understood, @ggppdk is not suggesting that the xx-XX.finder.php be part of the language packs, but available in a library. Therefore Crowdin would be out of the scope.
I agree with that kind of solution, but still waiting for Hannes to explain how precisely this is supposed to work with multilingual/monolingual sites.

Hackwar · 2018-05-20T20:17:17Z

@infograf768 So far there is no plan/defined behavior for non-multilanguage sites. Right now (regardless of this change or not), Finder/Smart Search reads the language of the content item and nothing else. If the language is not set or is *, then it wont stem the word. Finder/Smart Search does not care if the site is set to be mono- or multi-lingual. That would be yet another (indeed necessary) change for this component, but mostly irrelevant to this PR.

Again, as I wrote before, it is nice that you bring up all the issues in Finder/Smart Search, but this PR is not meant to fix all of those, but concentrates on making tokenising and stemming possible for every language out there. I'm very much willing to work on this component and its plugins more, but I refuse to blow up the scope of the current PRs even further, just because you now notice that Finder/Smart Search has lots of issues. The changes that I'm proposing are only possible now with Joomla 4.0, otherwise we have to wait till Joomla 5.0 and that would simply be to long. So if we please could look at the current PRs, decide if they do what I claim they do, decide if that is an improvement or not, then test if they do their work correctly and then merge those, then I can get back to adressing the other issues of Finder/Smart Search. But if these PRs stay in limbo for the next year, then get closed because they are outdated and/or 4.0 is released, then we would loose a valuable opportunity to actually improve Joomla here. The longterm goal for 4.0 for me is to get rid of com_search, make com_finder useable and make it useable for all languages. That wont happen if PRs get put on hold because people noticed further issues which are out of scope of the original PR... </rant>

Hackwar · 2018-05-21T08:19:36Z

Just to list all current PRs for Finder/Smart Search:

[4.0] Removing artificial sharding from com_finder #20185 [4.0] Removing artificial sharding from com_finder
[4.0] Make the tuple length of com_finder configurable #20384 [4.0] Removing terms tuples from com_finder
[4.0] Using language specific tokeniser and stemmer for com_finder #20391 [4.0] Using language specific tokeniser and stemmer for com_finder

infograf768 · 2018-05-21T08:20:08Z

Honestly, if the stemmers are not used at all for monolanguage sites (through a parameter at least), this is totally useless.
In any case, I stand by the fact that these new stemmers do NOT need to be included in the language packs.
One goes with the other.

…o j4finder_language

Hackwar · 2018-06-18T16:41:49Z

I fixed the searching. Please test again.

Hackwar · 2018-06-18T17:22:13Z

I don't know what the issue with the CI failure is. Does not look like an issue on my end.

carlitorweb · 2018-06-18T18:05:06Z

@Hackwar can you give me an update for test this? is a lot to read!!, and so far I read, seen the translation team is the one mainly need to test this, right?

Hackwar · 2018-06-18T18:28:58Z

How to test

Create a multi-lingual website, enable the Smart Search content plugin.
Create an article with some text. Make sure that the text contains a word in singular (like "extension") and that it is set to a language.
Create another article with some text. Make sure that that text contains a word with the same stem (like "extensions" or "extensively") and that it is set to a language.
Search in the frontend for your first word. See that it only returns that one result.
Apply patch and rebuild the index in the backend.
Search in the frontend for your first word and see that it returns both articles.

It would be best to not use english or french, since those 2 languages were also supported in the past. If you use english or french, you have to know that your results in step 4 depend on the setting for the stemmer in the configuration of the component.

I hope that covers everything?

carlitorweb · 2018-06-18T18:37:55Z

Thank you!

infograf768 · 2018-06-19T05:15:49Z

I fixed the searching. Please test again.

Solved. Now searching for an unexisting term displays correctly.

I confirm that the search for a term existing in many languages only returns a result for the language in use.

Looks fine to me now concerning stemmer.

infograf768 · 2018-06-19T05:46:33Z

I have tested this item ✅ successfully on 10d252e

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.}

carlitorweb · 2018-06-19T15:20:50Z

I left something out?. When I rebuild the index, after applied the PR, I got this:

UPDATE: The JPatch Tester, lose some diff. This is a long PR, and this tool is not recommended for this scenario.

Hackwar · 2018-06-19T17:37:16Z

@carlitorweb this sounds like your branch did not check out properly. Or did you recreate/change the composer files?

carlitorweb · 2018-06-19T20:40:34Z

Search in the frontend for your first word and see that it returns both articles

@Hackwar no return both article only returns a result for the language in use. But this is the expected behavior

Tested with es and it

carlitorweb · 2018-06-19T20:41:44Z

I have tested this item ✅ successfully on 10d252e

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.}

Quy · 2018-06-19T22:10:35Z

administrator/components/com_finder/config.xml

-		<field
-			name="stem"
-			type="radio"
-			label="COM_FINDER_CONFIG_STEMMER_ENABLE_LABEL"


Remove language strings?

infograf768 · 2018-06-20T05:23:53Z

BTW: tested with Serbian sr-YU (after correcting the localise.php because of JString) and it works fine although we do not have any stemmer for it.

Looks like we have some conflicting files now but for me this can go RTC.

infograf768 · 2018-06-20T05:26:32Z

RTC after correcting conflicts.

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.}

…4finder_language # Conflicts: # composer.lock # libraries/vendor/composer/autoload_classmap.php # libraries/vendor/composer/autoload_psr4.php # libraries/vendor/composer/autoload_static.php

Hackwar · 2018-06-20T07:26:46Z

I've updated the composer files and hopefully those are correct now. So unless someone has an issue with the composer stuff, we can indeed merge this now.

wilsonge · 2018-06-20T10:59:09Z

Awesome work! Thanks Hannes and to all the people who got involved and tested!

Hackwar · 2018-06-20T11:59:48Z

Thanks for merging!

Hackwar added 2 commits May 14, 2018 10:57

Implementing language specific tokeniser and stemmer for com_finder

fe22e20

Fixing comments

d95ee89

Hackwar requested review from brianteeman and mbabker as code owners May 14, 2018 09:20

joomla-cms-bot added Language Change This is for Translators PR-4.0-dev labels May 14, 2018

Fixing codestyle

2f10b16

This was referenced May 21, 2018

[4.0] Removing artificial sharding from com_finder #20185

Merged

[4.0] Make the tuple length of com_finder configurable #20384

Merged

Merge branch 'j4finder_language' of github.com:Hackwar/joomla-cms int…

fa2f221

…o j4finder_language

Merge branch '4.0-dev' into j4finder_language

7940d61

Quy reviewed Jun 19, 2018

View reviewed changes

joomla-cms-bot added the RTC This Pull Request is Ready To Commit label Jun 20, 2018

Hackwar added 2 commits June 20, 2018 08:26

Merge branch '4.0-dev' of https://github.com/joomla/joomla-cms into j…

3fad22c

…4finder_language # Conflicts: # composer.lock # libraries/vendor/composer/autoload_classmap.php # libraries/vendor/composer/autoload_psr4.php # libraries/vendor/composer/autoload_static.php

Updating composer files

65e8994

Removing obsolete language strings

3e3aa06

joomla-cms-bot added the Language Change This is for Translators label Jun 20, 2018

wilsonge merged commit ff3ed42 into joomla:4.0-dev Jun 20, 2018

joomla-cms-bot removed the RTC This Pull Request is Ready To Commit label Jun 20, 2018

Hackwar deleted the j4finder_language branch June 20, 2018 11:59

zero-24 added this to the Joomla 4.0 milestone Jun 20, 2018

infograf768 mentioned this pull request Mar 13, 2019

[4.0] Fix uninstalling extensions #24163

Merged

Hackwar mentioned this pull request Mar 23, 2021

[4.0] Prevent finding no stemmer in finder when language is "*" #31795

Closed

[4.0] Using language specific tokeniser and stemmer for com_finder #20391

[4.0] Using language specific tokeniser and stemmer for com_finder #20391

Conversation

Hackwar commented May 14, 2018 • edited by infograf768 Loading

Tokenisation

Stemming

Providing language specific tokeniser and stemmer

How to test

Hackwar commented May 14, 2018

infograf768 commented May 14, 2018 • edited Loading

Hackwar commented May 14, 2018

infograf768 commented May 14, 2018

Hackwar commented May 14, 2018

brianteeman commented May 14, 2018

Bakual commented May 14, 2018

Hackwar commented May 14, 2018

Bakual commented May 14, 2018

infograf768 commented May 14, 2018

brianteeman commented May 14, 2018

Hackwar commented May 14, 2018

Bakual commented May 14, 2018

infograf768 commented May 14, 2018

ReLater commented May 14, 2018

ggppdk commented May 15, 2018 • edited Loading

Bakual commented May 15, 2018

infograf768 commented May 15, 2018

Hackwar commented May 20, 2018

Hackwar commented May 21, 2018

infograf768 commented May 21, 2018

Hackwar commented Jun 18, 2018

Hackwar commented Jun 18, 2018

carlitorweb commented Jun 18, 2018 • edited Loading

Hackwar commented Jun 18, 2018

How to test

carlitorweb commented Jun 18, 2018

infograf768 commented Jun 19, 2018

infograf768 commented Jun 19, 2018

carlitorweb commented Jun 19, 2018 • edited Loading

Hackwar commented Jun 19, 2018

carlitorweb commented Jun 19, 2018 • edited Loading

carlitorweb commented Jun 19, 2018

Quy Jun 19, 2018

Choose a reason for hiding this comment

infograf768 commented Jun 20, 2018

infograf768 commented Jun 20, 2018

Hackwar commented Jun 20, 2018

wilsonge commented Jun 20, 2018 • edited Loading

Hackwar commented Jun 20, 2018

Hackwar commented May 14, 2018 •

edited by infograf768

Loading

infograf768 commented May 14, 2018 •

edited

Loading

ggppdk commented May 15, 2018 •

edited

Loading

carlitorweb commented Jun 18, 2018 •

edited

Loading

carlitorweb commented Jun 19, 2018 •

edited

Loading

carlitorweb commented Jun 19, 2018 •

edited

Loading

wilsonge commented Jun 20, 2018 •

edited

Loading