Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with multivalued keyword_normalized field display #109

Open
jduss4 opened this issue Feb 6, 2020 · 3 comments
Open

Issue with multivalued keyword_normalized field display #109

jduss4 opened this issue Feb 6, 2020 · 3 comments

Comments

@jduss4
Copy link
Contributor

jduss4 commented Feb 6, 2020

While working on #96 , in which my goal was to ignore markup (<em>), some unicode chars (Á, ø, etc), and unimportant characters at the beginning of titles (", [), I was 99% of the way there when I ran into an interesting problem with fields that had multiple values pushed to them.

When using a top_hits aggregation and asking for the _source field back, on single valued fields I got something along the lines of (pseudo code):

"facets": {
  "author":{
    "aaa" : {
       "num" : 4,
       "source": "Áaa"
    },
    "ben benjamin" : {
      "num" : 10,
      "source": "[Ben] Benjamin"
    }
  }
}

HOWEVER if there were multiple values from specific documents which were determined to be the "top hit", then this happened:

"facets": {
  "title":{
    "my antonia" : {
       "num" : 40,
       "source": [ "Death Comes for the Archbishop", "My Ántonia", "The Professor's House" ]
    }
  }
}

I looked into the idea of using a "scripted" field instead to try to return only the SINGLE most relevant result, but I kind of bogged down there trying to figure it out. Also, the documentation for script fields says (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-script-fields):

It’s important to understand the difference between doc['my_field'].value and params['_source']['my_field']. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (you can’t return a json object from it) and makes sense only for non-analyzed or single term based fields. However, using doc is still the recommended way to access values from the document, if at all possible, because _source must be loaded and parsed every time it’s used. Using _source is very slow.

For now, I am normalizing the "source" fields coming back only if they are an array and then attempting to match them against the already normalized version in order to figure out which one to display. It is not a good solution, and I would like to investigate this more in the future.

@jduss4
Copy link
Contributor Author

jduss4 commented Feb 6, 2020

Some documents that might be important while solving this problem.

Schema Setup: Small Example

settings:
  analysis:
    char_filter:
      escapes:
        type: mapping
        mappings:
          - "<em> => "
          - "</em> => "
          - "<u> => "
          - "</u> => "
          - "<strong> => "
          - "</strong> => "
          - "- => "
          - "& => "
          - ": => "
          - "; => "
          - ", => "
          - ". => "
          - "$ => "
          - "@ => "
          - "~ => "
          - "\" => "
          - "' => "
          - "[ => "
          - "] => "
    normalizer:
      keyword_normalized:
        type: custom
        char_filter:
          - escapes
        filter:
          - asciifolding
          - lowercase
mappings:
  properties:
    works:
      type: keyword
      normalizer: keyword_normalized

Crude format of Elasticsearch request

# if nested, has extra syntax
      elsif f.include?(".")
        path = f.split(".").first
        aggs[f] = {
          "nested" => {
            "path" => path
          },
          "aggs" => {
            f => {
              "terms" => {
                "field" => f,
                "order" => { type => dir },
                "size" => size
              },
              "aggs" => {
                "top_matches" => {
                  "top_hits" => {
                    "_source" => {
                      "includes" => [ f ]
                    },
                    "size" => 1
                  }
                }
              }
            }
          }
        }
      else
        aggs[f] = {
          "terms" => {
            "field" => f,
            "order" => { type => dir },
            "size" => size
          },
          "aggs" => {
            "top_matches" => {
              "top_hits" => {
                "_source" => {
                  "includes" => [ f ]
                },
                "size" => 1
              }
            }
          }
        }
      end
    end

Ends up looking like

image

@techgique
Copy link
Member

Do you think your current solution will really be a big performance problem? It seems like pretty straightforward code that won't be operating over huge sets of data.

I assume the lack of enthusiasm is just that you haven't figured out a way to get what you want back directly from Elasticsearch without further massaging it in the Rails app. Am I missing anything? 🤔

@jduss4
Copy link
Contributor Author

jduss4 commented Feb 17, 2020

Yes, that's essentially the source of my lack of enthusiasm. Also, I'm just not that excited that I have to imitate the normalization logic that we were already doing when things are ingested into elasticsearch, but I don't know a better way around it for this particular task. Sigh.

@wkdewey wkdewey added this to the v3 future features milestone May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants