-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.0] Using language specific tokeniser and stemmer for com_finder #20391
Conversation
Should we rename methods named "tokenize" to "tokenise" while we are at it? |
Where is that en-GB example? What will happen when a language has no language.finder.php What happens when a site has a mix of languages in its contents? or even is pure multilingual? |
The en-GB example is in /language/en-GB/en-GB.finder.php and translation teams would have to add their version of that file in that same place in their language folder in their language. When a language does not have such a file, it will fall back on the latin based tokenisation that we are using right now. So tokenisation only has to be overriden for languages that will not work with the white-space-based tokenisation. Stemming is language specific and if not implemented, will only return the original term instead. So translation teams don't HAVE to provide such a file, but their language would benefit from this. If a site has several content items with different languages, it will choose the right tokeniser and stemmer based on the language of that content item. If people mix languages in a content item that has been marked with a specific language, it will use that tokeniser (which normally shouldn't be an issue) and that stemmer. (What happens then would be dependent on the stemmer implementation. Normally that isn't a problem. In worst case you don't get stemmed words and thus can only search for the specific words.) If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled. |
I hope you are also going to propose the stemmers as I doubt any TT would be able to code this. I include myself in that group. |
Here is a good implementation of stemmers for several languages: https://github.com/wamania/php-stemmer |
if i read the changes correctly then this PR removes existing stemmers eg for french |
I see two issues: The idea of this PR is certainly good and valid, but the proposal in its current state isn't going to work out well in practice simply due to translaters not being coders. |
@brianteeman Yes, this removes the old stemmers, since the translation packages need to provide those. Up to this PR, we effectively only supported english and french, with this PR these at least can be extended for every language out there. @Bakual The stemmer (and in parts the tokeniser) are depending on the respective language and unless we want to go the same way like we did in the past and only support X languages, this has to be made modular and be installed by default when users install a language. Regarding the coding of a stemmer: Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them. As I said, there are stemmers for most languages out there. I'd be happy to provide translators with the necessary code for at least all the stemmers that the project that I linked above provides. That would bring us from 2 to 12 supported languages. Also keep in mind that stemming is not a requirement, but an optional feature. Stemmer and tokeniser for a specific component don't have any business in a global localisation file. (Yes, we should remove the methods that are com_search specific from those localise.php files.) If you have a better solution where to put the code, then I'm all ears, but it has to be part of the translation/localisation package, otherwise the localisation simply wouldn't be complete. (And again, this file is not mandatory, but an optional feature.) |
If we (read "You") have to write them anyway, we can as well just put them into core. Same result in the end.
Theoretically yes. But the file will be on Crowdin and a language will only ever be 100% translated if that file is also "translated". And it has to be working code or the language pack will be broken. Again, I think the idea is nice and in theory everything is fine. It's the practice where it will fail. |
I hardly can consider most languages there in https://github.com/wamania/php-stemmer#languages (12 only as of today). Only one using non-latin alphabet. I certainly would love it if this was practically feasible for the majority of J languages and, as @Bakual rightfully says, independent from lang packs. |
although to be fair the current codebase doesnt support those languages either |
Sorry, I don't get you guys. As brian wrote, right now we don't support none-white-space languages and in terms of stemming, we only support english and french and there is no way to fix that. Now I was so foolish to provide a solution that allows translation packs to provide the best (optional) solution for their language for both situations and you are shooting this down because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there. |
I'm not shooting it down. Again I think it's a valid idea and I like the idea. I'm just saying that in its current form it will not work out as expected because it will not work in practice.
That point was to make it clear the current proposal wasn't going to work. Because I'm absolutely clear that you will not provide this, And Michael will also not do that. And then we've run out of developers who understand the stemmer.
It's not about reaching an arbitary label. It's important to have a language at 100%. If we ever want to automate something around language packs, this will trigger it (currently it already triggers the building of the pack). I'm not trying to shoot it down. I'm trying to get it improved somehow. I just don't know how yet. |
I do not understand this. As you should know, when multilingual is not implemented in 4.0, then all content items are marked as ‘*’. Does that mean that the stemmers would only work when the site is muultilingual? |
There are so many changes in J4 where I say only some isolated programmers and experts understand them and that others (also extension providers!) can*t adapt/use them in the future without learning/understanding lots of completely new stuff ... ... or have to quit ... ... or have to fall back into Joomla3 times and techniques. Thus, for me arguments like "poor supporters won't understand how to" are not really comprehensible here. BTW: Nearly nobody understood the stemmer files before this pr, too. Not to mention how to integrate them. That's the reason why we only have 2 language specific stemmers for years now in Joomla ;-) |
This PR looks like a major improvement, that will be usable not only in com_finder but also elsewhere (by 3rd party too !!)
Stemmers need to be coded, right ? there will never be an easy solution Also why there is need to involve the translation teams ? And any other language packs that want this ... will need a programmer to add it, because stemmer (and tokeniser), are not like language strings that are frequently updated |
It's one thing to write the stemmer. Assuming the team finds a developer who can write it (doubtfully), we still can't integrate the file into Crowdin. Which means our workflow (as decided by OSM) will be broken. We would need to host that file somewhere on GitHub and have an own build process for the language packs, while currently the packs are created by Crowdin. If someone wants to write a custom solution for this, we can talk further. Or maybe someone else has an idea how we can include an optional code (pseudo-translation) file in Crowdin. |
@Bakual |
@infograf768 So far there is no plan/defined behavior for non-multilanguage sites. Right now (regardless of this change or not), Finder/Smart Search reads the language of the content item and nothing else. If the language is not set or is *, then it wont stem the word. Finder/Smart Search does not care if the site is set to be mono- or multi-lingual. That would be yet another (indeed necessary) change for this component, but mostly irrelevant to this PR. Again, as I wrote before, it is nice that you bring up all the issues in Finder/Smart Search, but this PR is not meant to fix all of those, but concentrates on making tokenising and stemming possible for every language out there. I'm very much willing to work on this component and its plugins more, but I refuse to blow up the scope of the current PRs even further, just because you now notice that Finder/Smart Search has lots of issues. The changes that I'm proposing are only possible now with Joomla 4.0, otherwise we have to wait till Joomla 5.0 and that would simply be to long. So if we please could look at the current PRs, decide if they do what I claim they do, decide if that is an improvement or not, then test if they do their work correctly and then merge those, then I can get back to adressing the other issues of Finder/Smart Search. But if these PRs stay in limbo for the next year, then get closed because they are outdated and/or 4.0 is released, then we would loose a valuable opportunity to actually improve Joomla here. The longterm goal for 4.0 for me is to get rid of com_search, make com_finder useable and make it useable for all languages. That wont happen if PRs get put on hold because people noticed further issues which are out of scope of the original PR... </rant> |
Just to list all current PRs for Finder/Smart Search:
|
Honestly, if the stemmers are not used at all for monolanguage sites (through a parameter at least), this is totally useless. |
…o j4finder_language
I fixed the searching. Please test again. |
I don't know what the issue with the CI failure is. Does not look like an issue on my end. |
@Hackwar can you give me an update for test this? is a lot to read!!, and so far I read, seen the translation team is the one mainly need to test this, right? |
How to test
It would be best to not use english or french, since those 2 languages were also supported in the past. If you use english or french, you have to know that your results in step 4 depend on the setting for the stemmer in the configuration of the component. I hope that covers everything? |
Thank you! |
I have tested this item ✅ successfully on 10d252e This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391. |
@carlitorweb this sounds like your branch did not check out properly. Or did you recreate/change the composer files? |
@Hackwar no return both article only returns a result for the language in use. But this is the expected behavior Tested with |
I have tested this item ✅ successfully on 10d252e This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391. |
<field | ||
name="stem" | ||
type="radio" | ||
label="COM_FINDER_CONFIG_STEMMER_ENABLE_LABEL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove language strings?
BTW: tested with Serbian sr-YU (after correcting the localise.php because of JString) and it works fine although we do not have any stemmer for it. Looks like we have some conflicting files now but for me this can go RTC. |
RTC after correcting conflicts. This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391. |
…4finder_language # Conflicts: # composer.lock # libraries/vendor/composer/autoload_classmap.php # libraries/vendor/composer/autoload_psr4.php # libraries/vendor/composer/autoload_static.php
I've updated the composer files and hopefully those are correct now. So unless someone has an issue with the composer stuff, we can indeed merge this now. |
Awesome work! Thanks Hannes and to all the people who got involved and tested! |
Thanks for merging! |
Tokenisation
In order to build the index, com_finder has to tokenise the input from a string into single terms. Right now com_finder here basically only really supports latin based languages and chinese (by a hack). Languages that do not seperate words by white space are not supported. This PR introduces a FinderIndexerLanguage class, which tokenises strings and which can be overridden by language specific tokenisers, so that all languages can be supported. The different tokenisation scripts would have to be supported by the translation teams.
Stemming
com_finder also supports stemming words in order to improve the search results. Stemming reduces a term to its base and thus allows for searching both singular and plural variants of a word with the same search term or different timecases of a word. Unfortunately com_finder up till now only had 3 stemmers, which supported only a small range of languages. Again, this has been outsourced into a file in the language pack.
Providing language specific tokeniser and stemmer
Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.
How to test
Unfortunately this is a pretty opaque change as long as there is only the en-GB implementation of the language support file. Simply said, you should be able to take a current 4.0-dev, test indexing and searching, apply these changes and do the same tests, seeing that you got the same results.