-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNO-172-upgrade-to-elasticsearch7 #296
base: dev
Are you sure you want to change the base?
Conversation
916faeb
to
2b86c6e
Compare
33d8cba
to
bdc6355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks close, just a couple of questions.
Also looks like real test failure where cherry^
is no longer returning results (404). I noticed the same query between demo and prod also returning less cherry^
results (looks like Files specifically)..
return { | ||
'type': 'object', | ||
'include_in_all': False, | ||
'properties': properties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this always had properties
why the if/else now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is no mapping directive exists in the schema, object should be stored without its fields being analyzed. else
clause prevents unnecessary dynamic mapping and update_mapping
events in elasticsearch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, previous assumption was that if the field was important, it would be in the schema. This can be fixed, implication is that any other object inserted that doesn't have a mapping directive in the schema will have all of its field values be analyzed as keyword.
@@ -133,10 +134,6 @@ def schema_mapping(name, schema): | |||
'type': field_type | |||
} | |||
|
|||
# these fields are unintentially partially matching some small search | |||
# keywords because fields are analyzed by nGram analyzer | |||
if name in NON_SUBSTRING_FIELDS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No equivalent replacement of copy_to: _all
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the newer commit: afbe476#diff-13be881525721a1251be911ff12f4c7a722ced10eb689a96d8c5821e8d9cd0e8R140
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks close, just a couple of questions.
Also looks like real test failure where
cherry^
is no longer returning results (404). I noticed the same query between demo and prod also returning lesscherry^
results (looks like Files specifically)..
we are back to the same assertions in the test and cherry^
is returning 1 result as before in the tests. With the latest change, the File count is equal now.
@@ -133,10 +134,6 @@ def schema_mapping(name, schema): | |||
'type': field_type | |||
} | |||
|
|||
# these fields are unintentially partially matching some small search | |||
# keywords because fields are analyzed by nGram analyzer | |||
if name in NON_SUBSTRING_FIELDS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the newer commit: afbe476#diff-13be881525721a1251be911ff12f4c7a722ced10eb689a96d8c5821e8d9cd0e8R140
f1e5cae
to
8cb4e1a
Compare
if name in NON_SUBSTRING_FIELDS: | ||
sub_mapping['include_in_all'] = False | ||
if name not in NON_SUBSTRING_FIELDS: | ||
if depth == 1 or (depth == 2 and parent == 'array'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the logic I'm most uncomfortable with, though the results certainly look closer than before. What's the implication of not tracking depth, as it didn't seem like it was tracked before? Is it possible to always copy to _all if name not in NON_SUBSTRING_FIELDS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Older demos showed that setting "copy_to": "_all"
always here is equivalent to boosting every single field in the nested objects (minus the NON_SUBSTRING_FIELDS
). Depth tracking stops "copy_to": "_all"
after level 1. Before, if file or experiment award description has "brain" and "blood" in the text, the files or experiments show up under either "blood" or "brain" search term. Non-specific experiments that showed up in the results in earlier demos were due to "copy_to" being applied to the fields of deeply nested objects. Older include_in_all
was applied by default and depth tracking is applied under the hypothesis that, this default application of the setting was weighed more towards top fields rather than nested object fields. Result is that, now all top fields of the object has copy_to_all
, and nested fields need to be boosted like so "file.award.description": 1.0
. We could have achieved the same matching results by boosting all of the individual fields of a type also without depth tracking, which is still an option but may take more effort to match results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like maybe defining in boost would make sense given we already have top-level fields there, e.g.:
"boost_values": {
"accession": 20.0,
"@type": 1.0,
"alternate_accessions": 1.0,
"assay_term_name": 20.0,
"assay_term_id": 1.0,
"assay_title": 20.0,
"assay_slims": 5.0,
"dbxrefs": 1.0,
"aliases": 1.0,
"biosample_ontology.term_id": 1.0,
"biosample_ontology.term_name": 10.0,
this default application of the setting was weighed more towards top fields rather than nested object fields.
I don't think this explains the case of award.description with blood or brain. Even if the weight was low on these they would still show up in results, though at the bottom. However we're not seeing them at all on current prod.
My understanding is that if _all
is enabled on old ES then include_in_all
is True by default. However you can set it to include_in_all: False
for a certain field and it will apply to all subfields unless otherwise specified. I think looking at new and old mappings informative here. For file:
I'm a little concerned since the default has changed from don't include unless specifically set to always include if in first level (probably explains why we are seeing more hits on new demo?). The boost offers more specific control, while always including top-level fields doesn't.
In any case might be close enough for our current purposes, and could be tweaked later. But I think worth looking more into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that boost offers more specific control. I find that it's not feasible to determine what to boost for the number of types we have multiplied by the number of top fields for each type, make demo for each combination, and compare results with the old ES. ctcf
search bringing up File
type results was due to unintentional boost on the aliases
field. Determining which other objects and fields were important to match the existing result count in the old ES maybe challenging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's likely the difference is just due to not copying the links
and unique_keys
stuff to _all
, which is by default include_in_all: True
on old ES:
Pretty sure it was the alias in unique keys that was matching ctcf
, not the the alias in embedded properties. (Can confirm that alias in embedded properties is False for old mapping.)
It doesn't look like any of the embedded properties in old mapping are include_in_all: True
unless they are defined in boost values for that schema. Could we go back to only copying embedded properties to _all if they are defined in boost, and just make the dynamic template links/unique keys have copy_to?
ctcf search bringing up File type results was due to unintentional boost on the aliases field
Yeah, looks like only field ctcf is mentioned in file is alias and submitted_file name, but it's probably fine to have these intentionally searchable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another demo with the changes: https://elasticsearch7-without-depth-tracking.demo.encodedcc.org/
8cb4e1a
to
157a13e
Compare
return { | ||
'type': 'object', | ||
'include_in_all': False, | ||
'properties': properties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing. I suppose we should take a look at the dynamic_template definitions one more time. Looking at the original mapping definition I guess these are included in _all by default. I don't think being able to search principals_allowed is that useful but maybe links and unique_keys would be.
It does seem like Possibly this also explains the file type search differences before you added the depth tracking? |
Think this is fine. File for a long time didn't have any access to experiment biosample or assay information. Now that it does we should probably explicitly add these to boost values. (In other words it's not that we expect |
00dc931
to
0c22a37
Compare
No description provided.