Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RAG capabilities of DocumentStore #8508

Open
wochinge opened this issue Oct 31, 2024 · 2 comments
Open

Improve RAG capabilities of DocumentStore #8508

wochinge opened this issue Oct 31, 2024 · 2 comments
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@wochinge
Copy link
Contributor

Is your feature request related to a problem? Please describe.
If you are building a RAG pipeline, then the indexing pipeline is of course an essential part of it.
You usually don't run the indexing once but rather have it as an ongoing process which synchronizes data from files to indexed documents. For this, one needs the following capabilities:

  • add new documents (covered)
  • delete documents by file id (not covered - currently it's only based on document ID, but usually files are split into multiple documents as part of indexing)
  • update document meta by file ID (not covered)
    • update document meta by document ID (not covered, more of an edge case)
  • delete all documents (not covered - could be done via file IDs, but would be nice to have it as part of the protocol as it's more efficient)

The current implementation of the DocumentStore protocol is in that regards a bit too simple. For production ready use cases you need the above methods so that you can actually build and maintain a RAG application.
Currently I need to manually implement this stuff outside of the document store protocol which means outside of Haystack which is painful and has potential for an improved developer experience.

Describe the solution you'd like
Extend the DocumentStore protocol and add implementations for the existing document stores.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Talk to your favorite deepset team if you want to get some more input on running RAG pipelines in production :-)

@Arputikos
Copy link

Up!

@julian-risch
Copy link
Member

Thanks @wochinge makes sense to me. We could look into this starting with OpenSearchDocumentStore and based on the feedback we get there add the capabilities to other DocumentStores too. Then update the protocol.

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

3 participants