You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on #96 , in which my goal was to ignore markup (<em>), some unicode chars (Á, ø, etc), and unimportant characters at the beginning of titles (", [), I was 99% of the way there when I ran into an interesting problem with fields that had multiple values pushed to them.
When using a top_hits aggregation and asking for the _source field back, on single valued fields I got something along the lines of (pseudo code):
It’s important to understand the difference between doc['my_field'].value and params['_source']['my_field']. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (you can’t return a json object from it) and makes sense only for non-analyzed or single term based fields. However, using doc is still the recommended way to access values from the document, if at all possible, because _source must be loaded and parsed every time it’s used. Using _source is very slow.
For now, I am normalizing the "source" fields coming back only if they are an array and then attempting to match them against the already normalized version in order to figure out which one to display. It is not a good solution, and I would like to investigate this more in the future.
The text was updated successfully, but these errors were encountered:
Do you think your current solution will really be a big performance problem? It seems like pretty straightforward code that won't be operating over huge sets of data.
I assume the lack of enthusiasm is just that you haven't figured out a way to get what you want back directly from Elasticsearch without further massaging it in the Rails app. Am I missing anything? 🤔
Yes, that's essentially the source of my lack of enthusiasm. Also, I'm just not that excited that I have to imitate the normalization logic that we were already doing when things are ingested into elasticsearch, but I don't know a better way around it for this particular task. Sigh.
While working on #96 , in which my goal was to ignore markup (
<em>
), some unicode chars (Á
,ø
, etc), and unimportant characters at the beginning of titles ("
,[
), I was 99% of the way there when I ran into an interesting problem with fields that had multiple values pushed to them.When using a
top_hits
aggregation and asking for the_source
field back, on single valued fields I got something along the lines of (pseudo code):HOWEVER if there were multiple values from specific documents which were determined to be the "top hit", then this happened:
I looked into the idea of using a "scripted" field instead to try to return only the SINGLE most relevant result, but I kind of bogged down there trying to figure it out. Also, the documentation for script fields says (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-script-fields):
For now, I am normalizing the "source" fields coming back only if they are an array and then attempting to match them against the already normalized version in order to figure out which one to display. It is not a good solution, and I would like to investigate this more in the future.
The text was updated successfully, but these errors were encountered: