Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add prefix and suffix to SentenceTransformersDocumentEmbedder #5745

Merged
merged 4 commits into from
Sep 13, 2023

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Sep 8, 2023

Related Issues

Proposed Changes:

Add prefix and suffix attributes to SentenceTransformersDocumentEmbedder.
They can be used to add a prefix and suffix to the Document text before embedding it.
This is necessary to take full advantage of some modern embedding models, such as E5.

How did you test it?

Updated existing tests, and added a new test.

Checklist

@anakin87 anakin87 requested review from a team as code owners September 8, 2023 08:57
@anakin87 anakin87 requested review from dfokina and MichelBartels and removed request for a team September 8, 2023 08:57
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Sep 8, 2023
@mathislucka
Copy link
Member

Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too.

@anakin87
Copy link
Member Author

anakin87 commented Sep 8, 2023

Why not use the prompt builder? That would be really powerful as it would allow to embed metadata too.

I originally had a similar idea: to create an Embedding template for Documents.

Then we had an internal discussion, in which the following aspects emerged:

  • We can already embed a list of metadata.

  • We have the intuition that adding the same words to all Documents in a collection
    (e.g. "The author of this document is: {doc.metadata['author']}")
    would not have a good effect on the vector representation, diluting its information power.

  • (The implementation of the Embedding template can be nontrivial.)

@mathislucka
Copy link
Member

We can already embed a list of metadata.

That's not very flexible, a metadata key might not be natural language but it also can't be easily changed because changing the data itself might be difficult.

We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.

Maybe better to do some experiments?

@anakin87
Copy link
Member Author

anakin87 commented Sep 8, 2023

This is a first implementation that we would like to merge. It unlocks the possibility to properly use powerful embedding models such as E5.

Currently, we haven't noticed significant interest or identified a clear use case for Embedding templates within the community.

If someone is interested in experimenting with this idea, they can easily create a custom component to directly manipulate the Document content.

If evidence emerges that Embedding templates offer substantial benefits, we can implement this feature in a future iteration.

@mathislucka
Copy link
Member

Not trying to block the merge, only commenting.

I think it would keep things simple. For example, templates would make sense for the ExtractiveReader too because there's currently no good way to account for metadata (see #5640). If templates could be applied before all inference nodes, then the users would only need to learn one concept and they would be pretty flexible in how they apply it. I'd prefer that over opaque embed_metadata and prefixes or suffixes but maybe that's just me.

Copy link
Contributor

@MichelBartels MichelBartels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. However, I would agree with Mathis that it seems like a less powerful version of the PromptBuilder that we have already.

We have the intuition that adding the same words to all Documents in a collection
(e.g. "The author of this document is: {doc.metadata['author']}")
would not have a good effect on the vector representation, diluting its information power.

I do not share this intuition and also think that it would need to be supported by experiments. I think this is clear if you consider the following (completely made up) example of book reviews where the metadata is author and title:

Title: J. R. R. Tolkien
Author: Humphrey Carpenter
Review: This biography is really great! I would recommend it to everyone.
Title: The Hobbit
Author: J. R. R. Tolkien
Review: Bilbo Baggins is very nice. This book is is great!

You could also remove the words that are the same:

J. R. R. Tolkien
Humphrey Carpenter
This biography is really great! I would recommend it to everyone.
The Hobbit
J. R. R. Tolkien
Bilbo Baggins defeats the dragon. This book is is great!

If your question was "What characters has J. R. R. Tolkien written about?", you need the additional structural information through those "shared words" to know that the document about Tolkien and not written by him is completely useless for this.
Of course, examples like this might occur seldomly or the models might not be able to pickup on this information, but we can't really tell without experiments.

On the other hand, a good reason to handle this in the SentenceTransformersDocumentEmbedder would be if the format was handled by the Embedder automatically. You could argue that the user shouldn't have to worry about those details and that it should be formatted correctly for the most popular models without the user having to care about it. Otherwise, it would also be an easy pitfall for users who don't know about this requirement.

@anakin87
Copy link
Member Author

Thanks, @MichelBartels!
I see that there are mixed opinions about Embedding Templates, which should be clarified by experimentation.
I opened a dedicated issue for this: #5793

In the meantime, I'm going to merge this PR...

@anakin87 anakin87 self-assigned this Sep 13, 2023
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6171163319

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.006%) to 48.965%

Files with Coverage Reduction New Missed Lines %
preview/components/embedders/sentence_transformers_document_embedder.py 1 97.62%
utils/context_matching.py 1 95.7%
Totals Coverage Status
Change from base Build 6170931279: 0.006%
Covered Lines: 11851
Relevant Lines: 24203

💛 - Coveralls

@anakin87 anakin87 merged commit 283ecf2 into main Sep 13, 2023
@anakin87 anakin87 deleted the st-doc-embedder-prefix-suffix branch September 13, 2023 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prefix and suffix for SentenceTransformersDocumentEmbedder
4 participants