Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add INSTRUCTOR Embedder (v2) #5836

Closed
wants to merge 2 commits into from

Conversation

awinml
Copy link
Contributor

@awinml awinml commented Sep 19, 2023

Related Issues

Fixes #5242

Proposed Changes:

Add support for the INSTRUCTOR family of Embedding Models. They tailor the embeddings for different tasks and domains using a prompt (instruction) for each embedding.

  • The implementation is based on the Embedders Proposal and Embedder Design (Implement Embedders components (2.0) #5312).

  • Adds InstructorEmbeddingBackend, responsible for performing the actual embedding computation, implemented as a singleton class in order to reuse instances.

  • Adds InstructorTextEmbedder, a component that embeds a list of strings into a list of vectors.

  • Adds InstructorDocumentEmbedder, a component that embeds a list of Documents. The embedding of each Document is stored in the embedding field of the Document.

Since the INSTRUCTOR Embedding Models are a modification of the Sentence Transformers models, the implementation is similar to the Sentence Transformers implementation (#5567).

This code was written collaboratively with @vrunm.

How did you test it?

  • Unit Tests have been added for each component.

Notes for the reviewer

Checklist

@awinml awinml requested a review from a team as a code owner September 19, 2023 08:30
@awinml awinml requested review from ZanSara and removed request for a team September 19, 2023 08:30
@CLAassistant
Copy link

CLAassistant commented Sep 19, 2023

CLA assistant check
All committers have signed the CLA.

@awinml awinml requested a review from a team as a code owner September 19, 2023 08:43
@awinml awinml requested review from dfokina and removed request for a team September 19, 2023 08:43
@ZanSara
Copy link
Contributor

ZanSara commented Sep 19, 2023

@anakin87 FYI

@anakin87
Copy link
Member

Hello @awinml and thanks for your contribution!

Some tests in the CI are failing.
Please have a look and let us know if you need help...

@awinml
Copy link
Contributor Author

awinml commented Sep 22, 2023

@anakin87 The code uses the InstructorEmbedding library (PyPI; GitHub) to load the INSTRUCTOR Embedding model. The tests fails due to the missing import of this library.

I think we need to add InstructorEmbedding as a dependency to the haystack-preview-package, for these CI tests to pass. Should I open a PR there for the same?

@anakin87
Copy link
Member

Hey, @awinml!

Sorry for the delay...

Today I will try to understand how to manage Instructor dependency and get back to you...

@anakin87 anakin87 self-assigned this Sep 26, 2023
@anakin87
Copy link
Member

Hello @awinml and @vrunm and thanks for this PR,
which looks very good! 🙌

After an internal discussion, we prefer to not include this new dependency (InstructorEmbedding) in the Haystack 2.0 core project, which we want to keep lightweight.

In any case, the integration with INSTRUCTOR is interesting and we appreciate your efforts.

For this reason, I am creating a PR in haystack-extras to accommodate a new package called instructor-embedders-haystack with the functionality you are working on.
Once the PR is merged, we can guide you to move your work there.
In the future, to use INSTRUCTOR embedders with Haystack, you will simply type
pip install instructor-embedders-haystack.

I hope this solution sounds good to you...
Let me know if there are any doubts.

@anakin87
Copy link
Member

This is the subproject I prepared for your component: https://github.com/deepset-ai/haystack-extras/tree/main/components/instructor-embedders

Feel free to fork https://github.com/deepset-ai/haystack-extras and create a PR there.
Let me know if you need any help!

@awinml
Copy link
Contributor Author

awinml commented Sep 28, 2023

@anakin87 Thanks!

This solution sounds good. I will create a PR with this work in https://github.com/deepset-ai/haystack-extras.

@awinml
Copy link
Contributor Author

awinml commented Sep 28, 2023

@anakin87
Copy link
Member

Great! 🙌

Let's continue in deepset-ai/haystack-core-integrations#32
I will try to review it tomorrow...

@anakin87 anakin87 closed this Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for new embedding models
4 participants