[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627

frankiebromage1 · 2024-11-22T23:27:31Z

Topic

vector db based destination issue

Relevant information

It appears when using Milvus as a destination and defining a field in the data e.g. 'TEXT' as Text fields to embed that the text that gets passed to the embedder is 'TEXT: example text' rather than just 'example text'. This changes the values of the embeddings produced and therefore the similarity they have to other vectors. It would be nice to be able to embed only the text without the field names.

The text was updated successfully, but these errors were encountered:

aaronsteers · 2024-12-02T21:41:05Z

@frankiebromage1 - Thanks for logging this and for the linked PR. For context, can you provide an example in your use case where the prefixed field name causes a sub-optimal embedding?

frankiebromage1 · 2024-12-04T00:37:39Z

@aaronsteers our data science team want to use airbyte for a snowflake -> milvus pipeline in order to embed a field 'LINE_ITEM_DESCRIPTION' for bill items. They then intend to perform vector search to compare new bill line item descriptions to the existing in the database. Because the bill item descriptions are often quite short, the addition of 'LINE_ITEM_DESCRIPTION :' or even any replacement such as '_ :' as a prefix to the text to be embedded affects the results of their vector search, even if the same prefix is added to the new items. In the example below you can see that the cosine similarity is much less when when there is no prefix within the embedding.

from openai import OpenAI
import numpy as np
from numpy.linalg import norm
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

item_1_with_field_prefix = get_embedding("LINE_ITEM_DESCRIPTION: DA Ext Door - Install Satin Nickel Single Cylinder Deadbolt Featuring Smartkey Security") #default airbyte embedding
item_1_with_symbol_prefix  = get_embedding("_: DA Ext Door - Install Satin Nickel Single Cylinder Deadbolt Featuring Smartkey Security") #airbyte embedding when field name is mapped to '_'
item_1_no_prefix = get_embedding("DA Ext Door - Install Satin Nickel Single Cylinder Deadbolt Featuring Smartkey Security") #preferred behavior when field name is omitted

item_2_with_field_prefix  = get_embedding("LINE_ITEM_DESCRIPTION: DA Trash Out (Adjust for Severity)") #default airbyte embedding
item_2_with_symbol_prefix = get_embedding("_: DA Trash Out (Adjust for Severity)") #airbyte embedding when field name is mapped to '_'
item_2_no_prefix = get_embedding("DA Trash Out (Adjust for Severity)") #preferred behavior when field name is omitted

def cosine_similarity(a, b):
   return np.dot(a, b) / (norm(a) * norm(b))

print(cosine_similarity(item_1_with_field_prefix, item_2_with_field_prefix))#0.3356051697289621
print(cosine_similarity(item_1_with_symbol_prefix, item_2_with_symbol_prefix))#0.2732279689867596
print(cosine_similarity(item_1_no_prefix, item_2_no_prefix))#0.15904993145660737

Because of this issue we need to build a custom pipeline for our use case and are unable to utilize the airbyte connector.

frankiebromage1 added the needs-triage label Nov 22, 2024

octavia-squidington-iii added autoteam community team/use labels Nov 22, 2024

frankiebromage linked a pull request Nov 27, 2024 that will close this issue

Feat: destination vector db add option for embedding without field names airbytehq/airbyte-python-cdk#91

Open

marcosmarxm removed the needs-triage label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627

[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627

frankiebromage1 commented Nov 22, 2024

aaronsteers commented Dec 2, 2024

frankiebromage1 commented Dec 4, 2024 •

edited

Loading

[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627

[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627

Comments

frankiebromage1 commented Nov 22, 2024

Topic

Relevant information

aaronsteers commented Dec 2, 2024

frankiebromage1 commented Dec 4, 2024 • edited Loading

frankiebromage1 commented Dec 4, 2024 •

edited

Loading