Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add JinaReaderConnector #1150

Merged
merged 48 commits into from
Nov 21, 2024
Merged

Conversation

jlonge4
Copy link
Contributor

@jlonge4 jlonge4 commented Oct 22, 2024

Related Issues

Proposed Changes:

Implement the Jina AI Reader as a haystack component

How did you test it?

Manual Verification

Notes for the reviewer

N/A

Checklist

@jlonge4 jlonge4 requested a review from a team as a code owner October 22, 2024 01:00
@jlonge4 jlonge4 requested review from anakin87 and removed request for a team October 22, 2024 01:00
@CLAassistant
Copy link

CLAassistant commented Oct 22, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added integration:jina type:documentation Improvements or additions to documentation labels Oct 22, 2024
@jlonge4
Copy link
Contributor Author

jlonge4 commented Oct 22, 2024

Hi @anakin87, @Anitha6g and I were trying to think of the best way to handle returning the output of the tool.
A document definitely makes sense for the read and search mode, but what about ground as it returns: Statement, Factuality, Result, Reason.

The unfortunate thing is that it is all markdown and not structured. Should we parse the content and put those things in doc metadata? Thoughts?

@anakin87
Copy link
Member

Let me have a look ⌛

@anakin87
Copy link
Member

I'm finally back, after doing a deeper investigation 🕵️ and discussing this with @bilgeyucel.

  • We think this component should be named JinaReaderConnector and placed in components/connectors

  • While studying the issue with grounding, I found that you can always get a JSON response by all endpoints (reader, search, grounding), if you pass the header 'Accept': 'application/json'. You can play with this and see the results at https://jina.ai/reader/.

  • After discovering this JSON response, I would say that we should use this by default because it is more granular and informative (we can add an init parameter called json_response wich defaults to True). This implies that for search we won't have a Document but one for each search result. So I would change the general return type to List[Document] and do something similar to what is discussed below.

    • If json_response==False: return a list of only one Document with only the content, which is a markdown string.
    • If json_response==True, this means we should do some parsing/packing of the results depending on the mode:
      • for reader, create a list of one Document with the returned content as content and all other fields (title, URL, ...) copied into meta
      • for search, create a list of Documents (one for each search result) with the returned content as content and all other fields (title, URL, ...) copied into meta
      • for grounding (not a great solution, but OK), create a list of one Document with the reason field as content and all fields (result, reference, ..., including reason) copied into meta
  • By playing with the website, I discovered that there are many other options to run this API which can be set using headers. For this reason, I would add a parameter called headers to the run method, to make these option available to people.

Let me know what you think. I am quite convinced about the plan, even if it requires some work.

@jlonge4
Copy link
Contributor Author

jlonge4 commented Oct 23, 2024

@anakin87 @bilgeyucel great points, will get right on it and rebase in a couple days. Thanks for the direction as always 😎

@vblagoje
Copy link
Member

Very nice @jlonge4 , I wanted to use such a component.

@jlonge4
Copy link
Contributor Author

jlonge4 commented Nov 11, 2024

@vblagoje thanks very much! Still have to whip up those tests but should be done(ish) today 🤞🏼

@jlonge4
Copy link
Contributor Author

jlonge4 commented Nov 12, 2024

@vblagoje @anakin87 what stupid mistake am I making with these test failures and import structure?

@anakin87
Copy link
Member

anakin87 commented Nov 12, 2024

@jlonge4 the circular import error should be fixed now.
Integrations mount paths are a bit tricky.

@jlonge4
Copy link
Contributor Author

jlonge4 commented Nov 19, 2024

@anakin87 how are we looking?

@anakin87
Copy link
Member

@jlonge4 in the coming days, I will take a proper look, so as to push this to the finish line.

@anakin87 anakin87 changed the title feat: jina reader #663 feat: add JinaReaderConnector Nov 21, 2024
@anakin87 anakin87 self-requested a review November 21, 2024 16:58
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the code a bit, added more tests and an example.

This is ready to be merged now.
We may need help with other stuff related to this integration.

Let's sync in #663!

@anakin87 anakin87 merged commit f9d0e77 into deepset-ai:main Nov 21, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:jina type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants