LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

t83714 · 2024-05-27T00:09:47Z

Description

This ticket is about the data structure & OpenSearch index schema design.

Technical Requirements

The data structure should be generic enough to cover:
- our evolving overall indexing strategy described in epic: LLM Powered Search Engine #3503
- all kinds of logical / physical data item
- store sufficient metadata to recover the origin context & retrieve the original item

Proposed Data structure & Indexing Structure

We will only use one index for storing all LLM indexing information.

We will define the following fields:

itemType:
- type: keyword
- possible value:
- registryRecord: a registry record. It could be a dataset, distribution, organisation etc. We likely only care about dataset & distribution initially. When itemType = 'registryRecord', the recordId & aspectId field must be present.
- storageObject: indicates the index target of this index item is a storage object (file).
- In future, we could add more itemType to support more use cases e.g. api. we could index API purpose plus its open API schema so it's available as tools for LLM to chose from
recordId: optional; the registry record id of the record that we index for. Only available when itemType = registryRecord
- type: keyword
aspectId: optional; the aspect id of the text field that we index on. Only available when itemType = registryRecord
- type: keyword
fieldName: optional; the field name of the field that we index on. Only available when itemType = registryRecord
- type: keyword
fileFormat: optional; Only available when itemType = storageObject
- type: keyword
subObjectId: optional; Only available when itemType = storageObject and when we need to index some non-text item. And when the in-context id of this sub-item is available.
- e.g. some papers might id the first diagram as fig.1
- Could also be other referenceable non-text content. e.g. data table.
subObjectType: optional; Only available when itemType = storageObject and when we need to index some referenceable non-text item.
- possible value:
  - diagram
  - chart
  - table
index_text_chunk:
- type: keyword
- Please note: it's up to the indexing strategy defined in LLM Powered Search Engine #3503 and relevant indexing strategy tickets to define how to construct the index_text_chunk.
  - e.g. for dataset's description field, it would be simply a text chunk of the original text content
  - e.g. for indexing a diagram in a PDF paper, you might want to include:
    - the short description of the diagram. Often underneath of the diagram
    - text chunks where the diagram is referenced in the paper.
embedding: store the embedding of the text chunk of the indexed text content (i.e. the content in index_text_chunk field).
- type: knn_vector
- dimension: 256? 512 to be decided
only_one_index_text_chunk: indicate whether the item is indexed by more than one text chuck.
- type: boolean
index_text_chunk_length:
- type: integer
index_text_chunk_position: the start position of the text chunk within the original full-text content
- type: integer
index_text_chunk_padding: no.of chars should be cut off at the joining point for each chunk when joining more than one chunk together
- type: integer

The text was updated successfully, but these errors were encountered:

t83714 added the refined-unreviewed Issues that have been refined by one person but not been reviewed by the rest of the team label May 27, 2024

t83714 self-assigned this May 27, 2024

t83714 mentioned this issue May 27, 2024

LLM Indexing Strategy: registry record metadata #3537

Open

t83714 added this to the v5.0.0 milestone Jun 28, 2024

t83714 mentioned this issue Jul 7, 2024

Hybrid Search #3549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

t83714 commented May 27, 2024 •

edited

Loading

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

Comments

t83714 commented May 27, 2024 • edited Loading

Description

Technical Requirements

Proposed Data structure & Indexing Structure

t83714 commented May 27, 2024 •

edited

Loading