-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627
[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627
Comments
@frankiebromage1 - Thanks for logging this and for the linked PR. For context, can you provide an example in your use case where the prefixed field name causes a sub-optimal embedding? |
@aaronsteers our data science team want to use airbyte for a snowflake -> milvus pipeline in order to embed a field 'LINE_ITEM_DESCRIPTION' for bill items. They then intend to perform vector search to compare new bill line item descriptions to the existing in the database. Because the bill item descriptions are often quite short, the addition of 'LINE_ITEM_DESCRIPTION :' or even any replacement such as '_ :' as a prefix to the text to be embedded affects the results of their vector search, even if the same prefix is added to the new items. In the example below you can see that the cosine similarity is much less when when there is no prefix within the embedding.
Because of this issue we need to build a custom pipeline for our use case and are unable to utilize the airbyte connector. |
Topic
vector db based destination issue
Relevant information
It appears when using Milvus as a destination and defining a field in the data e.g. 'TEXT' as
Text fields to embed
that the text that gets passed to the embedder is 'TEXT: example text' rather than just 'example text'. This changes the values of the embeddings produced and therefore the similarity they have to other vectors. It would be nice to be able to embed only the text without the field names.The text was updated successfully, but these errors were encountered: