LLM Indexing Strategy: Generic Data structure & Opensearch Index Schema Design #3536
Labels
refined-unreviewed
Issues that have been refined by one person but not been reviewed by the rest of the team
Milestone
Description
This ticket is about the data structure & OpenSearch index schema design.
Technical Requirements
Proposed Data structure & Indexing Structure
We will only use one index for storing all LLM indexing information.
We will define the following fields:
itemType
:keyword
registryRecord
: a registry record. It could be adataset
,distribution
,organisation
etc. We likely only care aboutdataset
&distribution
initially. WhenitemType
= 'registryRecord', therecordId
&aspectId
field must be present.storageObject
: indicates the index target of this index item is a storage object (file).itemType
to support more use cases e.g.api
. we could index API purpose plus its open API schema so it's available as tools for LLM to chose fromrecordId
:optional
; the registry record id of the record that we index for. Only available whenitemType
=registryRecord
keyword
aspectId
:optional
; the aspect id of the text field that we index on. Only available whenitemType
=registryRecord
keyword
fieldName
:optional
; the field name of the field that we index on. Only available whenitemType
=registryRecord
keyword
fileFormat
:optional
; Only available whenitemType
=storageObject
keyword
subObjectId
:optional
; Only available whenitemType
=storageObject
and when we need to index some non-text item. And when the in-context id of this sub-item is available.fig.1
subObjectType
:optional
; Only available whenitemType
=storageObject
and when we need to index some referenceable non-text item.diagram
chart
table
index_text_chunk
:keyword
index_text_chunk
.embedding
: store the embedding of the text chunk of the indexed text content (i.e. the content inindex_text_chunk
field).knn_vector
only_one_index_text_chunk
: indicate whether the item is indexed by more than one text chuck.index_text_chunk_length
:index_text_chunk_position
: the start position of the text chunk within the original full-text contentindex_text_chunk_padding
: no.of chars should be cut off at the joining point for each chunk when joining more than one chunk togetherThe text was updated successfully, but these errors were encountered: