Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536

Open
t83714 opened this issue May 27, 2024 · 0 comments
Assignees
Labels
refined-unreviewed Issues that have been refined by one person but not been reviewed by the rest of the team
Milestone

Comments

@t83714
Copy link
Contributor

t83714 commented May 27, 2024

Description

This ticket is about the data structure & OpenSearch index schema design.

Technical Requirements

  • The data structure should be generic enough to cover:
    • our evolving overall indexing strategy described in epic: LLM Powered Search Engine #3503
    • all kinds of logical / physical data item
    • store sufficient metadata to recover the origin context & retrieve the original item

Proposed Data structure & Indexing Structure

We will only use one index for storing all LLM indexing information.

We will define the following fields:

  • itemType:
    • type: keyword
    • possible value:
    • registryRecord: a registry record. It could be a dataset, distribution, organisation etc. We likely only care about dataset & distribution initially. When itemType = 'registryRecord', the recordId & aspectId field must be present.
    • storageObject: indicates the index target of this index item is a storage object (file).
    • In future, we could add more itemType to support more use cases e.g. api. we could index API purpose plus its open API schema so it's available as tools for LLM to chose from
  • recordId: optional; the registry record id of the record that we index for. Only available when itemType = registryRecord
    • type: keyword
  • aspectId: optional; the aspect id of the text field that we index on. Only available when itemType = registryRecord
    • type: keyword
  • fieldName: optional; the field name of the field that we index on. Only available when itemType = registryRecord
    • type: keyword
  • fileFormat: optional; Only available when itemType = storageObject
    • type: keyword
  • subObjectId: optional; Only available when itemType = storageObject and when we need to index some non-text item. And when the in-context id of this sub-item is available.
    • e.g. some papers might id the first diagram as fig.1
    • Could also be other referenceable non-text content. e.g. data table.
  • subObjectType: optional; Only available when itemType = storageObject and when we need to index some referenceable non-text item.
    • possible value:
      • diagram
      • chart
      • table
  • index_text_chunk:
    • type: keyword
    • Please note: it's up to the indexing strategy defined in LLM Powered Search Engine #3503 and relevant indexing strategy tickets to define how to construct the index_text_chunk.
      • e.g. for dataset's description field, it would be simply a text chunk of the original text content
      • e.g. for indexing a diagram in a PDF paper, you might want to include:
        • the short description of the diagram. Often underneath of the diagram
        • text chunks where the diagram is referenced in the paper.
  • embedding: store the embedding of the text chunk of the indexed text content (i.e. the content in index_text_chunk field).
    • type: knn_vector
    • dimension: 256? 512 to be decided
  • only_one_index_text_chunk: indicate whether the item is indexed by more than one text chuck.
    • type: boolean
  • index_text_chunk_length:
    • type: integer
  • index_text_chunk_position: the start position of the text chunk within the original full-text content
    • type: integer
  • index_text_chunk_padding: no.of chars should be cut off at the joining point for each chunk when joining more than one chunk together
    • type: integer
@t83714 t83714 added the refined-unreviewed Issues that have been refined by one person but not been reviewed by the rest of the team label May 27, 2024
@t83714 t83714 self-assigned this May 27, 2024
@t83714 t83714 added this to the v5.0.0 milestone Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refined-unreviewed Issues that have been refined by one person but not been reviewed by the rest of the team
Projects
None yet
Development

No branches or pull requests

1 participant