-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregation updates #2774
Merged
Merged
Aggregation updates #2774
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ection/catalogue-pipeline into Aggregation-updates
StepanBrychta
force-pushed
the
Aggregation-updates
branch
from
November 25, 2024 15:15
b7792da
to
a4e0312
Compare
2 tasks
StepanBrychta
force-pushed
the
Aggregation-updates
branch
from
November 25, 2024 16:44
89bcff8
to
3a0d62e
Compare
StepanBrychta
force-pushed
the
Aggregation-updates
branch
from
November 26, 2024 10:28
3a0d62e
to
4ebb3e8
Compare
kenoir
approved these changes
Dec 3, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this change?
wellcomecollection/platform#5825
This PR changes how we handle aggregations in the final indexes so that we can switch over to aggregating concepts based on IDs rather than labels.
The largest aspect of this change involves moving away from using stringified JSON fields for aggregations. This change had to be made because there is a one-to-many relationship between concept IDs and concept labels. Therefore, continuing to use stringified JSON fields for aggregations would mean returning some concept IDs in different aggregation buckets, which is not what we want. See here for more information.
All aggregatable fields are now indexed as nested fields with an
id
subfield and alabel
subfield for consistency. Fields which do not have both a label and an ID (e.g. dates) store the same value in theid
subfield and thelabel
subfield. This allows us to provide a consistent aggregation interface — the frontend no longer needs to distinguish betweenIdentifiedBucketData
andUnidentifiedBucketData
since all buckets are now identified.Note that this change does not remove support for current label-based aggregations — we can continue to use label-based aggregations in the frontend and switch to ID-based aggregations later.
For example, an aggregatable field which used to be stored like this:
"""{"id":"eng","label":"English","type":"Language"}"""
is now stored like this:
Note that we are no longer storing the
type
field because the frontend does not make use of this information. If we ever need to surface this field to the frontend, we can use a different method for indexing and retrieving it (e.g. top hits aggregations) without relying on stringified JSONs.Note
This PR on its own is not sufficient to switch to ID-based aggregations for concepts. For an acceptable user experience, we likely also need to address a separate issue with duplicate concept IDs.
How to test
Automated testing should suffice for now. We should extensively test this change after the planned reindex, but it makes more sense to do this testing locally from the frontend.
How can we measure success?
Added support for ID-based aggregations without negative side effects.
Have we considered potential risks?