Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation updates #2774

Merged
merged 13 commits into from
Dec 4, 2024
Merged

Aggregation updates #2774

merged 13 commits into from
Dec 4, 2024

Conversation

StepanBrychta
Copy link
Contributor

@StepanBrychta StepanBrychta commented Nov 25, 2024

What does this change?

wellcomecollection/platform#5825

This PR changes how we handle aggregations in the final indexes so that we can switch over to aggregating concepts based on IDs rather than labels.

The largest aspect of this change involves moving away from using stringified JSON fields for aggregations. This change had to be made because there is a one-to-many relationship between concept IDs and concept labels. Therefore, continuing to use stringified JSON fields for aggregations would mean returning some concept IDs in different aggregation buckets, which is not what we want. See here for more information.

All aggregatable fields are now indexed as nested fields with an id subfield and a label subfield for consistency. Fields which do not have both a label and an ID (e.g. dates) store the same value in the id subfield and the label subfield. This allows us to provide a consistent aggregation interface — the frontend no longer needs to distinguish between IdentifiedBucketData and UnidentifiedBucketData since all buckets are now identified.

Note that this change does not remove support for current label-based aggregations — we can continue to use label-based aggregations in the frontend and switch to ID-based aggregations later.


For example, an aggregatable field which used to be stored like this:

"""{"id":"eng","label":"English","type":"Language"}"""

is now stored like this:

{
  "id": "eng",
  "label": "English"
}

Note that we are no longer storing the type field because the frontend does not make use of this information. If we ever need to surface this field to the frontend, we can use a different method for indexing and retrieving it (e.g. top hits aggregations) without relying on stringified JSONs.

Note

This PR on its own is not sufficient to switch to ID-based aggregations for concepts. For an acceptable user experience, we likely also need to address a separate issue with duplicate concept IDs.

How to test

Automated testing should suffice for now. We should extensively test this change after the planned reindex, but it makes more sense to do this testing locally from the frontend.

How can we measure success?

Added support for ID-based aggregations without negative side effects.

Have we considered potential risks?

@StepanBrychta StepanBrychta marked this pull request as ready for review December 3, 2024 10:52
@StepanBrychta StepanBrychta requested a review from a team as a code owner December 3, 2024 10:52
@StepanBrychta StepanBrychta merged commit 6fb2856 into main Dec 4, 2024
5 checks passed
@StepanBrychta StepanBrychta deleted the Aggregation-updates branch December 4, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants