Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Make the tuple length of com_finder configurable #20384

Merged
merged 17 commits into from
Jul 18, 2018

Conversation

Hackwar
Copy link
Member

@Hackwar Hackwar commented May 13, 2018

[UPDATE] After the initial discussion, I changed this to make the tuple length of com_finder be configurable. This should make this less controversial.

Original text for the old PR

Finder is indexing not only every word, but also every combination of 2- and 3-word-tuples in the text to be indexed. This results in gigantic storage consumption by the index and a performance decrease when indexing of about 20%. At the same time, it only marginally improves the search results. When removing this feature, we are sacrificing the possibility to search for specific phrases of up to 3 terms (like "Joomla is great") for a reduction in necessary storage space of about 2/3. (and the performance increase in indexing)

How to test

Install 4.0-dev and go to the backend of Smart Search. Enable the Smart Search Plugin and start the indexing process. Notice that it takes a long time (preferably measure the time with a stopwatch). Go to the frontend and open the Smart Search menu item. Search for something and notice the results.
Now apply these changes and go into the backend of Smart Search. Clear the index and rebuild the index, again checking with a stopwatch. You should see a 20% performance increase here. Now go to the frontend and do the same search. You should get the same result.
Do a search for a specific phrase, wrapping the phrase in quotes. See that this does not work anymore.

Please also see #20185 for further improvements for com_finder.

I want to thank cloudaccess.net for supporting these changes. Without their generous help, this wouldn't be possible.

@chrisdavenport
Copy link
Contributor

I'd be opposed to this change as it is removing a feature that is required on some sites. A better solution would be to make the number of words per tuple a configurable parameter of the indexer (default to 1 for new installations).

@Hackwar
Copy link
Member Author

Hackwar commented May 13, 2018

Making this configurable would mean yet another option and quite a bit more code for something that 98% out there don't need. At the same time I think that smart search is not and can not be the ultimate site search engine, simply because we are writing a CMS and not a search engine. This component is a feature for all those sites out there that need a slightly better search than com_search, but if you need something really sophisticated, use a pro tool like Elastic Search.

So I'm advocating bit less complexity here. Right now this component is so complex that no one really worked out the issues in the last 5 years. Don't force this to stay abandonware.

@Hackwar
Copy link
Member Author

Hackwar commented May 21, 2018

Just to list all current PRs for Finder/Smart Search:

@laoneo
Copy link
Member

laoneo commented May 28, 2018

I'm here with @chrisdavenport. This is a feature I like. Perhaps we can think of a new option to enable/disable tuples. But I would not remove the functionality. I'm sure with some use of the showon property, we can make to options more lightweight.

@Hackwar Hackwar force-pushed the j4finder_termsphrases branch from 7a5ab32 to c26f898 Compare June 3, 2018 18:16
@joomla-cms-bot joomla-cms-bot added the Language Change This is for Translators label Jun 3, 2018
@Hackwar Hackwar changed the title [4.0] Removing terms tuples from com_finder [4.0] Make the tuple length of com_finder configurable Jun 3, 2018
@Hackwar
Copy link
Member Author

Hackwar commented Jun 3, 2018

I've now made this configurable.

@@ -168,33 +174,28 @@ public static function tokenize($input, $lang, $phrase = false)
$tokens[] = new FinderIndexerToken($terms[$i], $lang);
}

// Create two and three word phrase tokens from the individual words.
for ($i = 0, $n = count($tokens); $i < $n; $i++)
// Create multi-word phrases tokens from the individual words.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phrase not phrases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -40,6 +40,7 @@ COM_FINDER_CONFIG_STEMMER_FR="French Only"
COM_FINDER_CONFIG_STEMMER_LABEL="Select Language Stemmer"
COM_FINDER_CONFIG_STEMMER_PORTER_EN="English Only"
COM_FINDER_CONFIG_STEMMER_SNOWBALL="Snowball"
COM_FINDER_CONFIG_TUPLECOUNT_LABEL="Length indexed tuples"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really hard for me to say i this label is correct or not as I dont understand it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I have no idea how to phrase this short and elegantly. This is the setting how Joomla should index the text. If you set this to 1, then each word is just indexed separately. If you set it to something > 1, it also indexes every tuple of length 2 to $tuplecount.

Practical example: Text is

Joomla is a good CMS for everything

Then Joomla would index the following tuples for a setting of 3:

  • Joomla
  • is
  • a
  • good
  • CMS
  • for
  • everything
  • Joomla is
  • is a
  • a good
  • good CMS
  • CMS for
  • for everything
  • Joomla is a
  • is a good
  • a good CMS
  • good CMS for
  • CMS for everything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply label it with
Tuple Length
or
Maximum Tuple Length

??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@brianteeman brianteeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve language strings

@@ -40,6 +40,7 @@ COM_FINDER_CONFIG_STEMMER_FR="French Only"
COM_FINDER_CONFIG_STEMMER_LABEL="Select Language Stemmer"
COM_FINDER_CONFIG_STEMMER_PORTER_EN="English Only"
COM_FINDER_CONFIG_STEMMER_SNOWBALL="Snowball"
COM_FINDER_CONFIG_TUPLECOUNT_LABEL="Tuple Length"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if "Words Per Phrase" might be a little clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's correct then it's better. Although to be honest all the terms here are pretty meaningless. Snowball? Stemmer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the options to include an "advanced" switch that hides those options...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@infograf768
Copy link
Member

FYI I looked all over for a wording in French for tuple easily understandable (specialists use the word tuple).
Found "Ligne d'enregistrement d'une table relationnelle. " is also a definition for specialists...
Therefore, without a tip to explain what it is about, "Words per phrase" looks like the less bad solution.

@Hackwar
Copy link
Member Author

Hackwar commented Jun 12, 2018

I changed the wording. I would put the change with the advanced option for the configuration in another PR, since there are a few changes in the other PRs for com_finder.

@carlitorweb
Copy link
Member

WITHOUT THE PR (Index perform)

without-pr

WITH THE PR (Index perform)

with-pr

WITHOUT THE PR (Search perform)

search-without-pr

WITH THE PR (Search perform)

search-with-pr

IMO Sacrifice the ability to search for a specific phrase, if with this we obtain a better performance, worth it. It may be that after we have all these PRs inside, we can find a way to return this ability, or something similar.

@carlitorweb
Copy link
Member

carlitorweb commented Jun 18, 2018

Also, I notice, while I try search, the focus of the input gone, and I needed to click on the input again to continue writing. Some JS wrong? need check the console. Anyway, this is not related to this PR

@carlitorweb
Copy link
Member

I have tested this item ✅ successfully on 0e20307


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20384.

@infograf768
Copy link
Member

With these changes a lot of the "Advanced" tips are not necessary in the frontend form when "tuplecount' is set to 1. String is COM_FINDER_ADVANCED_TIPS

I am speaking of
screen shot 2018-07-03 at 16 33 31

I suggest therefore to condition their display.

Other than that, this looks fine. Thank you.

@Hackwar
Copy link
Member Author

Hackwar commented Jul 4, 2018

I added separate translation strings for each search feature and an intro and outro string.

@infograf768
Copy link
Member

will test asap

@Hackwar
Copy link
Member Author

Hackwar commented Jul 7, 2018

@carlitorweb @infograf768 would you have a few minutes to test this again?

@Quy
Copy link
Contributor

Quy commented Jul 8, 2018

Installed the sample demo, searched "blog" and not "sample" and got the following error:

PHP Notice:  Undefined offset: 0 in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1180
PHP Notice:  Trying to get property 'phrase' of non-object in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1262
PHP Notice:  Trying to get property 'term' of non-object in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1271
PHP Notice:  Trying to get property 'stem' of non-object in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1271
PHP Warning:  Creating default object from empty value in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1280
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1301
PHP Notice:  Undefined property: stdClass::$phrase in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1302
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1321
PHP Notice:  Undefined offset: 0 in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 1191
PHP Notice:  Undefined property: stdClass::$term in C:\xampp\htdocs\joomla-cms-j4finder_termsphrases\administrator\components\com_finder\helpers\indexer\query.php on line 458

@infograf768
Copy link
Member

infograf768 commented Jul 8, 2018

@Quy
The format is not
"blog" and not "sample"
but
blog not sample

EDIT: One should be able to also obtain the right results with
"blog" not "sample"
can you test again?

@infograf768
Copy link
Member

I have fatal error here before and after patch
[08-Jul-2018 11:02:36 Europe/Berlin] PHP Fatal error: Cannot use Joomla\CMS\Factory as Factory because the name is already in use in /Applications/MAMP/htdocs/installmulti/joomla40/components/com_finder/View/Search/HtmlView.php on line 20

Looking at the file I see that
use Joomla\CMS\Factory;
is used twice
IMHO this needs a specific urgent PR or direct change by maintainers of 4.0 ( @Webdongle @laoneo @mbabker )

@infograf768
Copy link
Member

After correcting this locally, all works fine.

@infograf768
Copy link
Member

PR for the Fatal Error : #21016

@Hackwar
Copy link
Member Author

Hackwar commented Jul 8, 2018

@Quy This is an interesting error. I just tested this and this is also present in 3.x. Can we merge/test this one without fixing that bug here and now? I will provide another PR to fix this soon. Just want this one merged in...

@Hackwar
Copy link
Member Author

Hackwar commented Jul 8, 2018

@infograf768 thanks for your PR. 😄

@infograf768
Copy link
Member

How may using a wrong format be a code error?

@Hackwar
Copy link
Member Author

Hackwar commented Jul 8, 2018

using a "wrong" format should indeed not be a big issue. In worst case you get no result at all. But it should NEVER throw an error like it does right now. @Quy might not get the result that he is looking for, but he definitely should not get the warnings that he got right now. That is an issue that we also have to fix in 3.x and you could consider that a (low-level) security issue. It is an information disclosure.

@carlitorweb
Copy link
Member

@wilsonge

Can we merge/test this one without fixing that bug here and now? I will provide another PR to fix this soon. Just want this one merged in...

@infograf768
Copy link
Member

I have tested this item ✅ successfully on 5f21dc9


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20384.

@ghost
Copy link

ghost commented Jul 10, 2018

@carlitorweb can you please retest?

@carlitorweb
Copy link
Member

I have tested this item ✅ successfully on 5f21dc9


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20384.

@ghost
Copy link

ghost commented Jul 10, 2018

Ready to Commit after two successful tests.

@joomla-cms-bot joomla-cms-bot added the RTC This Pull Request is Ready To Commit label Jul 10, 2018
@wilsonge wilsonge merged commit 1c59a62 into joomla:4.0-dev Jul 18, 2018
@joomla-cms-bot joomla-cms-bot removed the RTC This Pull Request is Ready To Commit label Jul 18, 2018
@wilsonge wilsonge added this to the Joomla 4.0 milestone Jul 18, 2018
@Hackwar
Copy link
Member Author

Hackwar commented Jul 18, 2018

Thank you!

@wilsonge
Copy link
Contributor

Thankyou!

@Hackwar Hackwar deleted the j4finder_termsphrases branch October 23, 2020 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Language Change This is for Translators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants