Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add JinaReaderConnector #1150

Merged
merged 48 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
142c1fd
begin rough draft
jlonge4 Oct 20, 2024
2209946
begin rough draft
jlonge4 Oct 20, 2024
0220cdc
begin rough draft
jlonge4 Oct 20, 2024
6b0d287
small fixes
Anitha6g Oct 21, 2024
e4d6115
Haystack document conversion
Anitha6g Oct 21, 2024
ad626f8
git folder changes
Anitha6g Oct 21, 2024
a56486f
Merge branch 'jl-jina-reader' into ag-branch
jlonge4 Oct 21, 2024
09d62f3
Merge pull request #3 from Anitha6g/ag-branch
jlonge4 Oct 21, 2024
f71d7b4
add pipeline functions
jlonge4 Oct 21, 2024
6b198a8
correct mode map
jlonge4 Oct 21, 2024
aa03e4d
add reader mode Enum class file
jlonge4 Oct 22, 2024
588bc66
add docstrings
jlonge4 Oct 22, 2024
1eb75f9
add JINA url for ref
jlonge4 Oct 22, 2024
b8f0a3f
add mode norm for mode map check in run method
jlonge4 Oct 22, 2024
faa8d17
add mode norm for mode map check in run method
jlonge4 Oct 22, 2024
a8e6a5b
add json_response and associated parsing
jlonge4 Oct 30, 2024
72e29f4
ignore api key lint error
jlonge4 Oct 30, 2024
09d8891
ignore api key lint error
jlonge4 Oct 30, 2024
ff36a9a
reduce code redundancy
jlonge4 Oct 30, 2024
f7623e2
reduce code redundancy
jlonge4 Oct 30, 2024
2f3afb0
add headers option to run method
jlonge4 Nov 1, 2024
f10d0b2
Update integrations/jina/src/haystack_integrations/components/reader/…
jlonge4 Nov 1, 2024
cb183de
Update integrations/jina/src/haystack_integrations/components/reader/…
jlonge4 Nov 1, 2024
9a955ba
Update integrations/jina/src/haystack_integrations/components/reader/…
jlonge4 Nov 1, 2024
3ca7b1b
Update integrations/jina/src/haystack_integrations/components/reader/…
jlonge4 Nov 1, 2024
91c4d6e
update location / final edits
jlonge4 Nov 1, 2024
bb4f0a4
Update integrations/jina/src/haystack_integrations/components/convert…
jlonge4 Nov 8, 2024
fe28a1e
update paths
jlonge4 Nov 8, 2024
6a66ea1
add descriptions for json response/headers
jlonge4 Nov 8, 2024
08f488a
lint
jlonge4 Nov 8, 2024
2a73245
unit tests for reader-connector
jlonge4 Nov 12, 2024
8271def
unit tests for reader-connector
jlonge4 Nov 12, 2024
d1a76c5
unit tests for reader-connector
jlonge4 Nov 12, 2024
13ed157
fix circular import
anakin87 Nov 12, 2024
bec5a9b
Merge branch 'main' into jl-jina-reader
anakin87 Nov 12, 2024
d9c8ffe
update header test
jlonge4 Nov 13, 2024
4d16188
update test
jlonge4 Nov 13, 2024
a3b30e1
update test
jlonge4 Nov 13, 2024
f893130
update test
jlonge4 Nov 13, 2024
4cdea70
update test
jlonge4 Nov 13, 2024
38fad06
update test
jlonge4 Nov 13, 2024
915aad9
update test
jlonge4 Nov 13, 2024
86e8bdb
update test
jlonge4 Nov 13, 2024
8a61e3b
update test
jlonge4 Nov 18, 2024
5913862
refactoring + more tests
anakin87 Nov 21, 2024
c169146
example
anakin87 Nov 21, 2024
72ed31c
pydoc config
anakin87 Nov 21, 2024
4f48f08
examples can contain print
anakin87 Nov 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions integrations/jina/examples/jina_reader_connector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# to make use of the JinaReaderConnector, we first need to install the Haystack integration
# pip install jina-haystack

# then we must set the JINA_API_KEY environment variable
# export JINA_API_KEY=<your-api-key>


from haystack_integrations.components.connectors.jina import JinaReaderConnector

# we can use the JinaReaderConnector to process a URL and return the textual content of the page
reader = JinaReaderConnector(mode="read")
query = "https://example.com"
result = reader.run(query=query)

print(result)
# {'documents': [Document(id=fa3e51e4ca91828086dca4f359b6e1ea2881e358f83b41b53c84616cb0b2f7cf,
# content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...',
# meta: {'title': 'Example Domain', 'description': '', 'url': 'https://example.com/', 'usage': {'tokens': 42}})]}


# we can perform a web search by setting the mode to "search"
reader = JinaReaderConnector(mode="search")
query = "UEFA Champions League 2024"
result = reader.run(query=query)

print(result)
# {'documents': Document(id=6a71abf9955594232037321a476d39a835c0cb7bc575d886ee0087c973c95940,
# content: '2024/25 UEFA Champions League: Matches, draw, final, key dates | UEFA Champions League | UEFA.com...',
# meta: {'title': '2024/25 UEFA Champions League: Matches, draw, final, key dates',
# 'description': 'What are the match dates? Where is the 2025 final? How will the competition work?',
# 'url': 'https://www.uefa.com/uefachampionsleague/news/...',
# 'usage': {'tokens': 5581}}), ...]}


# finally, we can perform fact-checking by setting the mode to "ground" (experimental)
reader = JinaReaderConnector(mode="ground")
query = "ChatGPT was launched in 2017"
result = reader.run(query=query)

print(result)
# {'documents': [Document(id=f0c964dbc1ebb2d6584c8032b657150b9aa6e421f714cc1b9f8093a159127f0c,
# content: 'The statement that ChatGPT was launched in 2017 is incorrect. Multiple references confirm that ChatG...',
# meta: {'factuality': 0, 'result': False, 'references': [
# {'url': 'https://en.wikipedia.org/wiki/ChatGPT',
# 'keyQuote': 'ChatGPT is a generative artificial intelligence (AI) chatbot developed by OpenAI and launched in 2022.',
# 'isSupportive': False}, ...],
# 'usage': {'tokens': 10188}})]}
1 change: 1 addition & 0 deletions integrations/jina/pydoc/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ loaders:
"haystack_integrations.components.embedders.jina.document_embedder",
"haystack_integrations.components.embedders.jina.text_embedder",
"haystack_integrations.components.rankers.jina.ranker",
"haystack_integrations.components.connectors.jina.reader",
]
ignore_when_discovered: ["__init__"]
processors:
Expand Down
7 changes: 6 additions & 1 deletion integrations/jina/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -132,18 +132,23 @@ ban-relative-imports = "parents"
[tool.ruff.lint.per-file-ignores]
# Tests can use magic values, assertions, and relative imports
"tests/**/*" = ["PLR2004", "S101", "TID252"]
# examples can contain "print" commands
"examples/**/*" = ["T201"]

[tool.coverage.run]
source = ["haystack_integrations"]
branch = true
parallel = false


[tool.coverage.report]
omit = ["*/tests/*", "*/__init__.py"]
show_missing = true
exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"]

[tool.pytest.ini_options]
minversion = "6.0"
markers = ["unit: unit tests", "integration: integration tests"]

[[tool.mypy.overrides]]
module = ["haystack.*", "haystack_integrations.*", "pytest.*"]
ignore_missing_imports = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0
from .reader import JinaReaderConnector
from .reader_mode import JinaReaderMode

__all__ = ["JinaReaderConnector", "JinaReaderMode"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

import json
from typing import Any, Dict, List, Optional, Union
from urllib.parse import quote

import requests
from haystack import Document, component, default_from_dict, default_to_dict
from haystack.utils import Secret, deserialize_secrets_inplace

from .reader_mode import JinaReaderMode

READER_ENDPOINT_URL_BY_MODE = {
JinaReaderMode.READ: "https://r.jina.ai/",
JinaReaderMode.SEARCH: "https://s.jina.ai/",
JinaReaderMode.GROUND: "https://g.jina.ai/",
}


@component
class JinaReaderConnector:
"""
A component that interacts with Jina AI's reader service to process queries and return documents.

This component supports different modes of operation: `read`, `search`, and `ground`.

Usage example:
```python
from haystack_integrations.components.connectors.jina import JinaReaderConnector

reader = JinaReaderConnector(mode="read")
query = "https://example.com"
result = reader.run(query=query)
document = result["documents"][0]
print(document.content)

>>> "This domain is for use in illustrative examples..."
```
"""

def __init__(
self,
mode: Union[JinaReaderMode, str],
api_key: Secret = Secret.from_env_var("JINA_API_KEY"), # noqa: B008
json_response: bool = True,
):
"""
Initialize a JinaReader instance.

:param mode: The operation mode for the reader (`read`, `search` or `ground`).
- `read`: process a URL and return the textual content of the page.
- `search`: search the web and return textual content of the most relevant pages.
- `ground`: call the grounding engine to perform fact checking.
For more information on the modes, see the [Jina Reader documentation](https://jina.ai/reader/).
:param api_key: The Jina API key. It can be explicitly provided or automatically read from the
environment variable JINA_API_KEY (recommended).
:param json_response: Controls the response format from the Jina Reader API.
If `True`, requests a JSON response, resulting in Documents with rich structured metadata.
If `False`, requests a raw response, resulting in one Document with minimal metadata.
"""
self.api_key = api_key
self.json_response = json_response

if isinstance(mode, str):
mode = JinaReaderMode.from_str(mode)
self.mode = mode

def to_dict(self) -> Dict[str, Any]:
"""
Serializes the component to a dictionary.
:returns:
Dictionary with serialized data.
"""
return default_to_dict(
self,
api_key=self.api_key.to_dict(),
mode=str(self.mode),
json_response=self.json_response,
)

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "JinaReaderConnector":
"""
Deserializes the component from a dictionary.
:param data:
Dictionary to deserialize from.
:returns:
Deserialized component.
"""
deserialize_secrets_inplace(data["init_parameters"], keys=["api_key"])
return default_from_dict(cls, data)

def _json_to_document(self, data: dict) -> Document:
"""
Convert a JSON response/record to a Document, depending on the reader mode.
"""
if self.mode == JinaReaderMode.GROUND:
content = data.pop("reason")
else:
content = data.pop("content")
document = Document(content=content, meta=data)
return document

@component.output_types(document=List[Document])
def run(self, query: str, headers: Optional[Dict[str, str]] = None):
"""
Process the query/URL using the Jina AI reader service.

:param query: The query string or URL to process.
:param headers: Optional headers to include in the request for customization. Refer to the
[Jina Reader documentation](https://jina.ai/reader/) for more information.

:returns:
A dictionary with the following keys:
- `documents`: A list of `Document` objects.
"""
headers = headers or {}
headers["Authorization"] = f"Bearer {self.api_key.resolve_value()}"

if self.json_response:
headers["Accept"] = "application/json"

endpoint_url = READER_ENDPOINT_URL_BY_MODE[self.mode]
encoded_target = quote(query, safe="")
url = f"{endpoint_url}{encoded_target}"

response = requests.get(url, headers=headers, timeout=60)

# raw response: we just return a single Document with text
if not self.json_response:
meta = {"content_type": response.headers["Content-Type"], "query": query}
return {"documents": [Document(content=response.content, meta=meta)]}

response_json = json.loads(response.content).get("data", {})
if self.mode == JinaReaderMode.SEARCH:
documents = [self._json_to_document(record) for record in response_json]
return {"documents": documents}

return {"documents": [self._json_to_document(response_json)]}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0
from enum import Enum


class JinaReaderMode(Enum):
"""
Enum representing modes for the Jina Reader.

Modes:
READ: Process a URL and return the textual content of the page.
SEARCH: Search the web and return the textual content of the most relevant pages.
GROUND: Call the grounding engine to perform fact checking.

"""

READ = "read"
SEARCH = "search"
GROUND = "ground"

def __str__(self):
return self.value

@classmethod
def from_str(cls, string: str) -> "JinaReaderMode":
"""
Create the reader mode from a string.

:param string:
String to convert.
:returns:
Reader mode.
"""
enum_map = {e.value: e for e in JinaReaderMode}
reader_mode = enum_map.get(string)
if reader_mode is None:
msg = f"Unknown reader mode '{string}'. Supported modes are: {list(enum_map.keys())}"
raise ValueError(msg)
return reader_mode
Loading