Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Using language specific tokeniser and stemmer for com_finder #20391

Merged
merged 40 commits into from
Jun 20, 2018

Conversation

Hackwar
Copy link
Member

@Hackwar Hackwar commented May 14, 2018

Tokenisation

In order to build the index, com_finder has to tokenise the input from a string into single terms. Right now com_finder here basically only really supports latin based languages and chinese (by a hack). Languages that do not seperate words by white space are not supported. This PR introduces a FinderIndexerLanguage class, which tokenises strings and which can be overridden by language specific tokenisers, so that all languages can be supported. The different tokenisation scripts would have to be supported by the translation teams.

Stemming

com_finder also supports stemming words in order to improve the search results. Stemming reduces a term to its base and thus allows for searching both singular and plural variants of a word with the same search term or different timecases of a word. Unfortunately com_finder up till now only had 3 stemmers, which supported only a small range of languages. Again, this has been outsourced into a file in the language pack.

Providing language specific tokeniser and stemmer

Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.

How to test

Unfortunately this is a pretty opaque change as long as there is only the en-GB implementation of the language support file. Simply said, you should be able to take a current 4.0-dev, test indexing and searching, apply these changes and do the same tests, seeing that you got the same results.

@Hackwar
Copy link
Member Author

Hackwar commented May 14, 2018

Should we rename methods named "tokenize" to "tokenise" while we are at it?

@infograf768
Copy link
Member

infograf768 commented May 14, 2018

Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.

Where is that en-GB example? What will happen when a language has no language.finder.php

What happens when a site has a mix of languages in its contents? or even is pure multilingual?

@Hackwar
Copy link
Member Author

Hackwar commented May 14, 2018

The en-GB example is in /language/en-GB/en-GB.finder.php and translation teams would have to add their version of that file in that same place in their language folder in their language.

When a language does not have such a file, it will fall back on the latin based tokenisation that we are using right now. So tokenisation only has to be overriden for languages that will not work with the white-space-based tokenisation. Stemming is language specific and if not implemented, will only return the original term instead. So translation teams don't HAVE to provide such a file, but their language would benefit from this.

If a site has several content items with different languages, it will choose the right tokeniser and stemmer based on the language of that content item. If people mix languages in a content item that has been marked with a specific language, it will use that tokeniser (which normally shouldn't be an issue) and that stemmer. (What happens then would be dependent on the stemmer implementation. Normally that isn't a problem. In worst case you don't get stemmed words and thus can only search for the specific words.) If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled.

@infograf768
Copy link
Member

I hope you are also going to propose the stemmers as I doubt any TT would be able to code this. I include myself in that group.

@Hackwar
Copy link
Member Author

Hackwar commented May 14, 2018

Here is a good implementation of stemmers for several languages: https://github.com/wamania/php-stemmer
That should cover all languages that we are supporting today and a few more. There are more stemmers out there that can be used.

@brianteeman
Copy link
Contributor

if i read the changes correctly then this PR removes existing stemmers eg for french

@Bakual
Copy link
Contributor

Bakual commented May 14, 2018

I see two issues:
First is like JM said that no TT will be able to write that, even if example code is there. TTs will not understand code which hardly any Joomla contributor understands 😄
Second is that you introduce a new PHP file in the language directory. I would rather love to see us getting rid of PHP files and use different technics if somehow possible. Because translation tools in general struggle to deal with that.
In any case, it would imho make more sense to move the function into the existing localise.php file and use existing workflows to deal with it.

The idea of this PR is certainly good and valid, but the proposal in its current state isn't going to work out well in practice simply due to translaters not being coders.

@Hackwar
Copy link
Member Author

Hackwar commented May 14, 2018

@brianteeman Yes, this removes the old stemmers, since the translation packages need to provide those. Up to this PR, we effectively only supported english and french, with this PR these at least can be extended for every language out there.

@Bakual The stemmer (and in parts the tokeniser) are depending on the respective language and unless we want to go the same way like we did in the past and only support X languages, this has to be made modular and be installed by default when users install a language.

Regarding the coding of a stemmer: Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them. As I said, there are stemmers for most languages out there. I'd be happy to provide translators with the necessary code for at least all the stemmers that the project that I linked above provides. That would bring us from 2 to 12 supported languages. Also keep in mind that stemming is not a requirement, but an optional feature.

Stemmer and tokeniser for a specific component don't have any business in a global localisation file. (Yes, we should remove the methods that are com_search specific from those localise.php files.) If you have a better solution where to put the code, then I'm all ears, but it has to be part of the translation/localisation package, otherwise the localisation simply wouldn't be complete. (And again, this file is not mandatory, but an optional feature.)

@Bakual
Copy link
Contributor

Bakual commented May 14, 2018

Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them.

If we (read "You") have to write them anyway, we can as well just put them into core. Same result in the end.
The feature is nice, but a feature that never gets used because nobody will be able to use it, is useless.

this file is not mandatory, but an optional feature

Theoretically yes. But the file will be on Crowdin and a language will only ever be 100% translated if that file is also "translated". And it has to be working code or the language pack will be broken.
So in practice it will be a mandatory file for translation which nobody is able to "translate".

Again, I think the idea is nice and in theory everything is fine. It's the practice where it will fail.

@infograf768
Copy link
Member

As I said, there are stemmers for most languages out there.

I hardly can consider most languages there in https://github.com/wamania/php-stemmer#languages (12 only as of today). Only one using non-latin alphabet.
Looks also that the code there would have to be modified to fit (as far as I can see by comparing with the en-GB.finder.php included here).
As for white-space-tokenisation I see no language there who do not use the white space and none at all using multibyte spaces. What happens with Chinese for example?

I certainly would love it if this was practically feasible for the majority of J languages and, as @Bakual rightfully says, independent from lang packs.

@brianteeman
Copy link
Contributor

although to be fair the current codebase doesnt support those languages either

@Hackwar
Copy link
Member Author

Hackwar commented May 14, 2018

Sorry, I don't get you guys. As brian wrote, right now we don't support none-white-space languages and in terms of stemming, we only support english and french and there is no way to fix that. Now I was so foolish to provide a solution that allows translation packs to provide the best (optional) solution for their language for both situations and you are shooting this down because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there.
If you are worried about not having the nice "100% complete" behind the language, then people can simply put an empty class in the file if you really need that file to get to a 100%.
Yes, that one project "only" supports 12 stemmers. There are more projects out there than that one. And again: If there is no stemming algorithm out there for a language, then there is nothing that we can do. But right now I really would like to be able to use com_finder with stemming on a german, spanish or russian website.

@Bakual
Copy link
Contributor

Bakual commented May 14, 2018

I'm not shooting it down. Again I think it's a valid idea and I like the idea. I'm just saying that in its current form it will not work out as expected because it will not work in practice.

because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there.

That point was to make it clear the current proposal wasn't going to work. Because I'm absolutely clear that you will not provide this, And Michael will also not do that. And then we've run out of developers who understand the stemmer.

If you are worried about not having the nice "100% complete" behind the language, then people can simply put an empty class in the file if you really need that file to get to a 100%.

It's not about reaching an arbitary label. It's important to have a language at 100%. If we ever want to automate something around language packs, this will trigger it (currently it already triggers the building of the pack).
Having to create an empty file so our prcoess works sounds quite stupid. And even creating that empty file is not that easy in Crowdin.

I'm not trying to shoot it down. I'm trying to get it improved somehow. I just don't know how yet.

@infograf768
Copy link
Member

@Hackwar

If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled.

I do not understand this. As you should know, when multilingual is not implemented in 4.0, then all content items are marked as ‘*’.

Does that mean that the stemmers would only work when the site is muultilingual?
I am confused

@ReLater
Copy link
Contributor

ReLater commented May 14, 2018

There are so many changes in J4 where I say only some isolated programmers and experts understand them and that others (also extension providers!) can*t adapt/use them in the future without learning/understanding lots of completely new stuff ... ... or have to quit ... ... or have to fall back into Joomla3 times and techniques.

Thus, for me arguments like "poor supporters won't understand how to" are not really comprehensible here.

BTW: Nearly nobody understood the stemmer files before this pr, too. Not to mention how to integrate them. That's the reason why we only have 2 language specific stemmers for years now in Joomla ;-)

@ggppdk
Copy link
Contributor

ggppdk commented May 15, 2018

This PR looks like a major improvement, that will be usable not only in com_finder but also elsewhere (by 3rd party too !!)

  1. How does the current solution make it any "easier" for translation teams to provide a word stemmer ?

  2. Also what would an "easier" solution be ?

Stemmers need to be coded, right ? there will never be an easy solution
So i don't understand the "difficult" argument

Also why there is need to involve the translation teams ?
12 language packs will just call the 3rd party code: e.g.
https://github.com/wamania/php-stemmer
which we should include in core libraries

And any other language packs that want this ... will need a programmer to add it,
and then the translation teams will never touch it, as they will be told that it is file that can only be done by a programmer

because stemmer (and tokeniser), are not like language strings that are frequently updated

@Bakual
Copy link
Contributor

Bakual commented May 15, 2018

And any other language packs that want this ... will need a programmer to add it,
and then the translation teams will never touch it, as they will be told that it is file that can only be done by a programmer

It's one thing to write the stemmer. Assuming the team finds a developer who can write it (doubtfully), we still can't integrate the file into Crowdin. Which means our workflow (as decided by OSM) will be broken. We would need to host that file somewhere on GitHub and have an own build process for the language packs, while currently the packs are created by Crowdin. If someone wants to write a custom solution for this, we can talk further. Or maybe someone else has an idea how we can include an optional code (pseudo-translation) file in Crowdin.

@infograf768
Copy link
Member

@Bakual
as far as I understood, @ggppdk is not suggesting that the xx-XX.finder.php be part of the language packs, but available in a library. Therefore Crowdin would be out of the scope.
I agree with that kind of solution, but still waiting for Hannes to explain how precisely this is supposed to work with multilingual/monolingual sites.

@Hackwar
Copy link
Member Author

Hackwar commented May 20, 2018

@infograf768 So far there is no plan/defined behavior for non-multilanguage sites. Right now (regardless of this change or not), Finder/Smart Search reads the language of the content item and nothing else. If the language is not set or is *, then it wont stem the word. Finder/Smart Search does not care if the site is set to be mono- or multi-lingual. That would be yet another (indeed necessary) change for this component, but mostly irrelevant to this PR.

Again, as I wrote before, it is nice that you bring up all the issues in Finder/Smart Search, but this PR is not meant to fix all of those, but concentrates on making tokenising and stemming possible for every language out there. I'm very much willing to work on this component and its plugins more, but I refuse to blow up the scope of the current PRs even further, just because you now notice that Finder/Smart Search has lots of issues. The changes that I'm proposing are only possible now with Joomla 4.0, otherwise we have to wait till Joomla 5.0 and that would simply be to long. So if we please could look at the current PRs, decide if they do what I claim they do, decide if that is an improvement or not, then test if they do their work correctly and then merge those, then I can get back to adressing the other issues of Finder/Smart Search. But if these PRs stay in limbo for the next year, then get closed because they are outdated and/or 4.0 is released, then we would loose a valuable opportunity to actually improve Joomla here. The longterm goal for 4.0 for me is to get rid of com_search, make com_finder useable and make it useable for all languages. That wont happen if PRs get put on hold because people noticed further issues which are out of scope of the original PR... </rant>

@Hackwar
Copy link
Member Author

Hackwar commented May 21, 2018

Just to list all current PRs for Finder/Smart Search:

@infograf768
Copy link
Member

Honestly, if the stemmers are not used at all for monolanguage sites (through a parameter at least), this is totally useless.
In any case, I stand by the fact that these new stemmers do NOT need to be included in the language packs.
One goes with the other.

@Hackwar
Copy link
Member Author

Hackwar commented Jun 18, 2018

I fixed the searching. Please test again.

@Hackwar
Copy link
Member Author

Hackwar commented Jun 18, 2018

I don't know what the issue with the CI failure is. Does not look like an issue on my end.

@carlitorweb
Copy link
Member

carlitorweb commented Jun 18, 2018

@Hackwar can you give me an update for test this? is a lot to read!!, and so far I read, seen the translation team is the one mainly need to test this, right?

@Hackwar
Copy link
Member Author

Hackwar commented Jun 18, 2018

How to test

  1. Create a multi-lingual website, enable the Smart Search content plugin.
  2. Create an article with some text. Make sure that the text contains a word in singular (like "extension") and that it is set to a language.
  3. Create another article with some text. Make sure that that text contains a word with the same stem (like "extensions" or "extensively") and that it is set to a language.
  4. Search in the frontend for your first word. See that it only returns that one result.
  5. Apply patch and rebuild the index in the backend.
  6. Search in the frontend for your first word and see that it returns both articles.

It would be best to not use english or french, since those 2 languages were also supported in the past. If you use english or french, you have to know that your results in step 4 depend on the setting for the stemmer in the configuration of the component.

I hope that covers everything?

@carlitorweb
Copy link
Member

Thank you!

@infograf768
Copy link
Member

I fixed the searching. Please test again.

Solved. Now searching for an unexisting term displays correctly.

screen shot 2018-06-19 at 06 58 50

I confirm that the search for a term existing in many languages only returns a result for the language in use.

Looks fine to me now concerning stemmer.

@infograf768
Copy link
Member

I have tested this item ✅ successfully on 10d252e


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.

@carlitorweb
Copy link
Member

carlitorweb commented Jun 19, 2018

I left something out?. When I rebuild the index, after applied the PR, I got this:

screenshot_20180619112205

UPDATE: The JPatch Tester, lose some diff. This is a long PR, and this tool is not recommended for this scenario.

@Hackwar
Copy link
Member Author

Hackwar commented Jun 19, 2018

@carlitorweb this sounds like your branch did not check out properly. Or did you recreate/change the composer files?

@carlitorweb
Copy link
Member

carlitorweb commented Jun 19, 2018

Search in the frontend for your first word and see that it returns both articles

@Hackwar no return both article only returns a result for the language in use. But this is the expected behavior

Tested with es and it

screenshot_20180619164926

@carlitorweb
Copy link
Member

I have tested this item ✅ successfully on 10d252e


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.

<field
name="stem"
type="radio"
label="COM_FINDER_CONFIG_STEMMER_ENABLE_LABEL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove language strings?

@infograf768
Copy link
Member

BTW: tested with Serbian sr-YU (after correcting the localise.php because of JString) and it works fine although we do not have any stemmer for it.

Looks like we have some conflicting files now but for me this can go RTC.

@infograf768
Copy link
Member

RTC after correcting conflicts.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.

@joomla-cms-bot joomla-cms-bot added the RTC This Pull Request is Ready To Commit label Jun 20, 2018
Hackwar added 2 commits June 20, 2018 08:26
…4finder_language

# Conflicts:
#	composer.lock
#	libraries/vendor/composer/autoload_classmap.php
#	libraries/vendor/composer/autoload_psr4.php
#	libraries/vendor/composer/autoload_static.php
@Hackwar
Copy link
Member Author

Hackwar commented Jun 20, 2018

I've updated the composer files and hopefully those are correct now. So unless someone has an issue with the composer stuff, we can indeed merge this now.

@joomla-cms-bot joomla-cms-bot added the Language Change This is for Translators label Jun 20, 2018
@wilsonge wilsonge merged commit ff3ed42 into joomla:4.0-dev Jun 20, 2018
@joomla-cms-bot joomla-cms-bot removed the RTC This Pull Request is Ready To Commit label Jun 20, 2018
@wilsonge
Copy link
Contributor

wilsonge commented Jun 20, 2018

Awesome work! Thanks Hannes and to all the people who got involved and tested!

@Hackwar
Copy link
Member Author

Hackwar commented Jun 20, 2018

Thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.