Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding an hierarchical document builder and a auto-merge retriever #56

Merged
merged 54 commits into from
Aug 26, 2024

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Aug 12, 2024

Related Issues

Proposed Changes:

Adds a HierarchicalDocumentBuilder: it's used to split a Document into multiple Document objects of different block sizes building a hierarchical tree structure where each smaller block is a child of a previous larger block.

Adds a AutoMergingRetriever leverages the hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents.

Picture to help understand what's being implemented

How did you test it?

  • unit tests
  • integration tests
  • manual verification

Notes for the reviewer

Checklist

@davidsbatista davidsbatista changed the title initial import feat: adding an hierarchical document builder and a auto-merge retriever Aug 12, 2024
@coveralls
Copy link

coveralls commented Aug 12, 2024

Pull Request Test Coverage Report for Build 10560416383

Details

  • 278 of 287 (96.86%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 93.376%

Changes Missing Coverage Covered Lines Changed/Added Lines %
haystack_experimental/components/retrievers/auto_merging_retriever.py 40 49 81.63%
Totals Coverage Status
Change from base Build 10524780287: 0.3%
Covered Lines: 3031
Relevant Lines: 3246

💛 - Coveralls

@davidsbatista davidsbatista marked this pull request as ready for review August 12, 2024 13:51
@davidsbatista davidsbatista requested a review from a team as a code owner August 12, 2024 13:51
@davidsbatista davidsbatista requested review from Amnah199, anakin87 and shadeMe and removed request for a team August 12, 2024 13:51
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass.

An end-to-end raw example script would still be helpful.

@shadeMe shadeMe removed the request for review from Amnah199 August 13, 2024 14:08
@davidsbatista davidsbatista requested a review from anakin87 August 14, 2024 15:23
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

Let's also add these components to the Experiments catalog in README.

@davidsbatista davidsbatista requested a review from a team as a code owner August 16, 2024 08:57
@davidsbatista davidsbatista requested review from dfokina and removed request for a team August 16, 2024 08:57
@anakin87
Copy link
Member

Let's also remember to add pydoc configs here.

@davidsbatista
Copy link
Contributor Author

still work-in-progress, I'm testing this new Retriever against every possible DocumentStore in the integrations repo and collecting the issues.

@davidsbatista
Copy link
Contributor Author

Here is the current status regarding all the doc stores, I'm still investigating Weaviate

  AutoMergeRetriever NOTE
Astra error only if "id" is used in the filter instead of "_id"
Chroma no support  
ElasticSearch supports  
OpenSearch supports  
PGVector supports  
PineCone error cannot filter by ID; https://docs.pinecone.io/guides/data/query-data#querying-by-record-id; https://community.pinecone.io/t/does-pinecone-support-filtering-by-vector-id/3039/2
Qdrant supports  
Weaviate error DocumentStoreError: Failed to query documents in Weaviate. Error: no such prop with name 'id' found in class 'Default' in the schema. Check your schema files for which properties in this class are available

Copy link
Contributor

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change then good to merge for me.

@davidsbatista davidsbatista merged commit f7eeb91 into main Aug 26, 2024
7 checks passed
@davidsbatista davidsbatista deleted the auto-merging-retriever branch August 26, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants