Aggregation updates #2774

StepanBrychta · 2024-11-25T14:18:25Z

What does this change?

This PR changes how we handle aggregations in the final indexes so that we can switch over to aggregating concepts based on IDs rather than labels.

The largest aspect of this change involves moving away from using stringified JSON fields for aggregations. This change had to be made because there is a one-to-many relationship between concept IDs and concept labels. Therefore, continuing to use stringified JSON fields for aggregations would mean returning some concept IDs in different aggregation buckets, which is not what we want. See here for more information.

All aggregatable fields are now indexed as nested fields with an id subfield and a label subfield for consistency. Fields which do not have both a label and an ID (e.g. dates) store the same value in the id subfield and the label subfield. This allows us to provide a consistent aggregation interface — the frontend no longer needs to distinguish between IdentifiedBucketData and UnidentifiedBucketData since all buckets are now identified.

Note that this change does not remove support for current label-based aggregations — we can continue to use label-based aggregations in the frontend and switch to ID-based aggregations later.

For example, an aggregatable field which used to be stored like this:

"""{"id":"eng","label":"English","type":"Language"}"""

is now stored like this:

{
  "id": "eng",
  "label": "English"
}

Note that we are no longer storing the type field because the frontend does not make use of this information. If we ever need to surface this field to the frontend, we can use a different method for indexing and retrieving it (e.g. top hits aggregations) without relying on stringified JSONs.

Note

This PR on its own is not sufficient to switch to ID-based aggregations for concepts. For an acceptable user experience, we likely also need to address a separate issue with duplicate concept IDs.

How to test

Automated testing should suffice for now. We should extensively test this change after the planned reindex, but it makes more sense to do this testing locally from the frontend.

How can we measure success?

Added support for ID-based aggregations without negative side effects.

Have we considered potential risks?

…ection/catalogue-pipeline into Aggregation-updates

StepanBrychta and others added 10 commits November 21, 2024 12:01

Refactor aggregations #5825

d9d13c3

Apply auto-formatting rules

1e34fcc

Refactor aggregations #5825

7c88ca0

Merge branch 'Aggregation-updates' of https://github.com/wellcomecoll…

958683f

…ection/catalogue-pipeline into Aggregation-updates

Apply auto-formatting rules

9763012

AggregatableField refactoring #5825

f759000

Apply auto-formatting rules

1ffa071

Add comment explaining AggregatableField #5825

411dbdc

Update test documents #5825

341330e

Merge branch 'main' into Aggregation-updates

a4e0312

StepanBrychta force-pushed the Aggregation-updates branch from b7792da to a4e0312 Compare November 25, 2024 15:15

StepanBrychta mentioned this pull request Nov 25, 2024

Aggregation updates wellcomecollection/catalogue-api#832

Open

2 tasks

StepanBrychta force-pushed the Aggregation-updates branch from 89bcff8 to 3a0d62e Compare November 25, 2024 16:44

Update test images #5825

4ebb3e8

StepanBrychta force-pushed the Aggregation-updates branch from 3a0d62e to 4ebb3e8 Compare November 26, 2024 10:28

StepanBrychta added 2 commits November 26, 2024 16:05

Explicitly set aggregation field type to 'nested' #5825

5fab5a7

Merge branch 'main' into Aggregation-updates

2a29a74

StepanBrychta marked this pull request as ready for review December 3, 2024 10:52

StepanBrychta requested a review from a team as a code owner December 3, 2024 10:52

kenoir approved these changes Dec 3, 2024

View reviewed changes

StepanBrychta merged commit 6fb2856 into main Dec 4, 2024
5 checks passed

StepanBrychta deleted the Aggregation-updates branch December 4, 2024 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregation updates #2774

Aggregation updates #2774

StepanBrychta commented Nov 25, 2024 •

edited

Loading

Aggregation updates #2774

Aggregation updates #2774

Conversation

StepanBrychta commented Nov 25, 2024 • edited Loading

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

StepanBrychta commented Nov 25, 2024 •

edited

Loading