Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[destination-milvus] Field names are included as text to embed changing semantic meaning of the vector #48627

Open
frankiebromage1 opened this issue Nov 22, 2024 · 2 comments · May be fixed by airbytehq/airbyte-python-cdk#91

Comments

@frankiebromage1
Copy link

Topic

vector db based destination issue

Relevant information

It appears when using Milvus as a destination and defining a field in the data e.g. 'TEXT' as Text fields to embed that the text that gets passed to the embedder is 'TEXT: example text' rather than just 'example text'. This changes the values of the embeddings produced and therefore the similarity they have to other vectors. It would be nice to be able to embed only the text without the field names.

@aaronsteers
Copy link
Collaborator

@frankiebromage1 - Thanks for logging this and for the linked PR. For context, can you provide an example in your use case where the prefixed field name causes a sub-optimal embedding?

@frankiebromage1
Copy link
Author

frankiebromage1 commented Dec 4, 2024

@aaronsteers our data science team want to use airbyte for a snowflake -> milvus pipeline in order to embed a field 'LINE_ITEM_DESCRIPTION' for bill items. They then intend to perform vector search to compare new bill line item descriptions to the existing in the database. Because the bill item descriptions are often quite short, the addition of 'LINE_ITEM_DESCRIPTION :' or even any replacement such as '_ :' as a prefix to the text to be embedded affects the results of their vector search, even if the same prefix is added to the new items. In the example below you can see that the cosine similarity is much less when when there is no prefix within the embedding.

from openai import OpenAI
import numpy as np
from numpy.linalg import norm
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model).data[0].embedding

item_1_with_field_prefix = get_embedding("LINE_ITEM_DESCRIPTION: DA Ext Door - Install Satin Nickel Single Cylinder Deadbolt Featuring Smartkey Security") #default airbyte embedding
item_1_with_symbol_prefix  = get_embedding("_: DA Ext Door - Install Satin Nickel Single Cylinder Deadbolt Featuring Smartkey Security") #airbyte embedding when field name is mapped to '_'
item_1_no_prefix = get_embedding("DA Ext Door - Install Satin Nickel Single Cylinder Deadbolt Featuring Smartkey Security") #preferred behavior when field name is omitted

item_2_with_field_prefix  = get_embedding("LINE_ITEM_DESCRIPTION: DA Trash Out (Adjust for Severity)") #default airbyte embedding
item_2_with_symbol_prefix = get_embedding("_: DA Trash Out (Adjust for Severity)") #airbyte embedding when field name is mapped to '_'
item_2_no_prefix = get_embedding("DA Trash Out (Adjust for Severity)") #preferred behavior when field name is omitted

def cosine_similarity(a, b):
   return np.dot(a, b) / (norm(a) * norm(b))

print(cosine_similarity(item_1_with_field_prefix, item_2_with_field_prefix))#0.3356051697289621
print(cosine_similarity(item_1_with_symbol_prefix, item_2_with_symbol_prefix))#0.2732279689867596
print(cosine_similarity(item_1_no_prefix, item_2_no_prefix))#0.15904993145660737

Because of this issue we need to build a custom pipeline for our use case and are unable to utilize the airbyte connector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants