-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add prefix
and suffix
to SentenceTransformersDocumentEmbedder
#5745
Conversation
Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too. |
I originally had a similar idea: to create an Embedding template for Documents. Then we had an internal discussion, in which the following aspects emerged:
|
That's not very flexible, a metadata key might not be natural language but it also can't be easily changed because changing the data itself might be difficult.
Maybe better to do some experiments? |
This is a first implementation that we would like to merge. It unlocks the possibility to properly use powerful embedding models such as E5. Currently, we haven't noticed significant interest or identified a clear use case for Embedding templates within the community. If someone is interested in experimenting with this idea, they can easily create a custom component to directly manipulate the Document content. If evidence emerges that Embedding templates offer substantial benefits, we can implement this feature in a future iteration. |
Not trying to block the merge, only commenting. I think it would keep things simple. For example, templates would make sense for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me. However, I would agree with Mathis that it seems like a less powerful version of the PromptBuilder
that we have already.
We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.
I do not share this intuition and also think that it would need to be supported by experiments. I think this is clear if you consider the following (completely made up) example of book reviews where the metadata is author and title:
Title: J. R. R. Tolkien
Author: Humphrey Carpenter
Review: This biography is really great! I would recommend it to everyone.
Title: The Hobbit
Author: J. R. R. Tolkien
Review: Bilbo Baggins is very nice. This book is is great!
You could also remove the words that are the same:
J. R. R. Tolkien
Humphrey Carpenter
This biography is really great! I would recommend it to everyone.
The Hobbit
J. R. R. Tolkien
Bilbo Baggins defeats the dragon. This book is is great!
If your question was "What characters has J. R. R. Tolkien written about?", you need the additional structural information through those "shared words" to know that the document about Tolkien and not written by him is completely useless for this.
Of course, examples like this might occur seldomly or the models might not be able to pickup on this information, but we can't really tell without experiments.
On the other hand, a good reason to handle this in the SentenceTransformersDocumentEmbedder
would be if the format was handled by the Embedder automatically. You could argue that the user shouldn't have to worry about those details and that it should be formatted correctly for the most popular models without the user having to care about it. Otherwise, it would also be an easy pitfall for users who don't know about this requirement.
Thanks, @MichelBartels! In the meantime, I'm going to merge this PR... |
Pull Request Test Coverage Report for Build 6171163319
💛 - Coveralls |
Related Issues
SentenceTransformersDocumentEmbedder
#5741Proposed Changes:
Add
prefix
andsuffix
attributes toSentenceTransformersDocumentEmbedder
.They can be used to add a prefix and suffix to the Document text before embedding it.
This is necessary to take full advantage of some modern embedding models, such as E5.
How did you test it?
Updated existing tests, and added a new test.
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.