Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Prevent finding no stemmer in finder when language is "*" #31795

Closed
wants to merge 7 commits into from

Conversation

bembelimen
Copy link
Contributor

Pull Request for Issue #30372 .

Summary of Changes

Try to find the correct language when "default star (*)" is given in finder stemmmer.

Testing Instructions

See: #30372

Actual result BEFORE applying this Pull Request

Current language not found when "*"

Expected result AFTER applying this Pull Request

Current language found.

@bembelimen
Copy link
Contributor Author

@Hackwar could you please check here and give your feedback?

@Hackwar
Copy link
Member

Hackwar commented Dec 28, 2020

There are loads of languages which have no stemmer, thus we can't assume a stemmer is present. If the language is *, we shouldn't even try finding a stemmer in the constructor of the Language class. Generally however we are handling this correctly from my perspective, since we are catching the exception correctly. If xdebug is too eager here, I don't really think this is our fault... Anyway, the correct and easiest solution here is to simply exclude * from getting a stemmer and that's it.

@Hackwar
Copy link
Member

Hackwar commented Dec 28, 2020

To extend a bit: stemming is a process that takes processing power and I can see several situations where you don't want to have a stemmer at all. * is the equivalent of "I don't want a stemmer"

@bembelimen
Copy link
Contributor Author

It was not the intention to prevent the exception but to find a stemmer if possible.

But I have no feelings if this improvement should be merge or not.

@richard67
Copy link
Member

richard67 commented Feb 1, 2021

Anyway, the correct and easiest solution here is to simply exclude * from getting a stemmer and that's it.

@bembelimen Could you change this PR so it follows @Hackwar 's suggestion? To me it seems to be the better way, too.

@brianteeman
Copy link
Contributor

That would mean that a default install will never use the stemmer even though the stemmer is available for english, wouldnt it?

@richard67
Copy link
Member

That would mean that a default install will never use the stemmer even though the stemmer is available for english, wouldnt it?

Hmm, that's true ... not what's desired.

@Hackwar
Copy link
Member

Hackwar commented Feb 1, 2021

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

@brianteeman
Copy link
Contributor

brianteeman commented Feb 1, 2021

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

And where do they do that?

* is the equivalent of "I don't want a stemmer"

* is used to mean either required or everything - I have never seen it used to mean nothing

@Hackwar
Copy link
Member

Hackwar commented Feb 1, 2021

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

And where do they do that?

The option is in the component configuration.

* is the equivalent of "I don't want a stemmer"

* is used to mean either required or everything - I have never seen it used to mean nothing

That is why that isn't visible to the user. I did not decide that * is the internal identifier for items without a language. *means I don't know the language (unless set in the component configuration) and thus I wouldn't know which stemmer to use. Since the wrong stemmer would be worse than no stemmer, people should make a conscious decision what they need.

@brianteeman
Copy link
Contributor

That is intended behavior. The stemmer is resource intensive and I want people to make a conscious decision to enable this, also to select the right one/configure the search in its entirety for their system instead of simply enabling it and not looking into it further.

And where do they do that?

The option is in the component configuration.

I asked as I dont see anything regarding a stemmer in the component configuration for Joomla 4.

I can see it in Joomla 3 but not 4

image

@Hackwar
Copy link
Member

Hackwar commented Feb 1, 2021

It is the "Default Language" option.

@brianteeman
Copy link
Contributor

Oh that is so obvious - not

How on earth is anyone to know then that by selecting a default language there is (as you claim) a performance hit.

That list is showing the languages on the site. Nothing to do with if those languages have a stemmer available

@brianteeman
Copy link
Contributor

And as the j3 field no longer exists you're breaking functionality on upgrade without any documentation

@rdeutz
Copy link
Contributor

rdeutz commented Mar 23, 2021

Can we fix this with better text in the default language select?

Maybe

None -> None (Stemmer disabled)
Default Site Languge (Stemmer enabled)

@brianteeman
Copy link
Contributor

brianteeman commented Mar 23, 2021

From what @Hackwar has said then no because the default site language might not have a stemmer available.

@brianteeman
Copy link
Contributor

Plus there is nothing in the option to indicate that it is anything to do with a stemmer at all

image

@Hackwar
Copy link
Member

Hackwar commented Mar 23, 2021

Because it's not only the stemmer. This setting selects which language to use for indexing the content. That selection both influences how the text is tokenised as well as the stemmer to use. For example chinese texts don't have spaces between words, so we can't simply do a $words = explode(' ', $text); to split up the text into words. It also influences which list of stopwords to use and maybe in the future this could also select a synonym/thesaurus system. If it were just about the stemmer, it would be called "Select the stemmer".

@brianteeman
Copy link
Contributor

WOW - it's no wonder we get nowhere when you keep changing your mind on what information you will share about your code.

This is the first time in this entire thread that you have said it was about something more than a stemmer. Every comment you have made here has only referred to this as being a stemmer. @bembelimen obviously thought it was just about a stemmer as he wrote in the PR. You didnt correct him then. I thought it was about a stemmer and you didnt correct me then.

Only now three months later you wake up and start telling everyone its more than a stemmer.

As I said before and now even more re-enforced by your latest revelation this is all an undocumented backwards compatibility break that is being hidden by you from everyone.

@Hackwar
Copy link
Member

Hackwar commented Mar 23, 2021

Since this feature was broken in J3, there is no b/c to be broken. Besides the fact that our b/c claims don't actually extend to components and a major version actually does allow us to break b/c.

I've described the functionality in the original PR here: #20391

I did not go into an elaborate rant about how the complete indexing system of Smart Search works, because I tend to not write 6-page-long essays to shift blame or try to make others feel stupid. I was asked to give feedback to this PR and concentrated on the context of this PR, which was the code to generate a stemmer object.

All the relevant code in the Joomla\Component\Finder\Administrator\Indexer\Language class has been there since Joomla 2.5, except it was spread out over several classes and actually did not work, especially for languages other than english. It didn't bother you until now.

@brianteeman
Copy link
Contributor

Breaking b/c is fine it just has to be documented

@rdeutz
Copy link
Contributor

rdeutz commented Apr 6, 2021

So the missing point is that we need to document this somewhere, can @Hackwar or @bembelimen do that please, Thanks.

@brianteeman
Copy link
Contributor

I dont understand why you have removed the release blocker. Its not resolved

@chmst chmst changed the base branch from 4.0-dev to 4.1-dev January 31, 2022 15:28
@chmst chmst removed the PR-4.0-dev label Jan 31, 2022
@HLeithner HLeithner changed the base branch from 4.1-dev to 4.2-dev June 27, 2022 13:08
@HLeithner
Copy link
Member

This pull request has automatically rebased to 4.2-dev.

@joomla-bot
Copy link
Contributor

This pull requests has been automatically converted to the PSR-12 coding standard.

@rdeutz
Copy link
Contributor

rdeutz commented Oct 21, 2022

Closing because this change is not a solution for the initial problem. Reopened issue #30372

@rdeutz rdeutz closed this Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants