Skip to content

Commit

Permalink
Merge branch 'main' into remove-error-pypi-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Amnah199 authored Nov 22, 2024
2 parents 9a87e8d + 6db7399 commit 864e1e8
Show file tree
Hide file tree
Showing 8 changed files with 431 additions and 9 deletions.
56 changes: 48 additions & 8 deletions integrations/jina/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,49 @@
# Changelog

## [integrations/jina-v0.5.0] - 2024-11-21

### 🚀 Features

- Add `JinaReaderConnector` (#1150)

### 📚 Documentation

- Update docstrings of JinaDocumentEmbedder and JinaTextEmbedder (#1092)

### ⚙️ CI

- Adopt uv as installer (#1142)

### 🧹 Chores

- Update ruff linting scripts and settings (#1105)


## [integrations/jina-v0.4.0] - 2024-09-18

### 🧪 Testing

- Do not retry tests in `hatch run test` command (#954)

### ⚙️ Miscellaneous Tasks
### ⚙️ CI

- Retry tests to reduce flakyness (#836)

### 🧹 Chores

- Update ruff invocation to include check parameter (#853)
- Update Jina Embedder usage for V3 release (#1077)

### 🌀 Miscellaneous

- Remove references to Python 3.7 (#601)
- Jina - add missing ranker to API reference (#610)
- Jina ranker: fix wrong URL in docstring (#628)
- Chore: add license classifiers (#680)
- Chore: change the pydoc renderer class (#718)
- Ci: install `pytest-rerunfailures` where needed; add retry config to `test-cov` script (#845)
- Chore: Jina - ruff update, don't ruff tests (#982)

## [integrations/jina-v0.3.0] - 2024-03-19

### 🚀 Features
Expand All @@ -22,13 +54,17 @@

- Fix order of API docs (#447)

This PR will also push the docs to Readme

### 📚 Documentation

- Update category slug (#442)
- Disable-class-def (#556)

### 🌀 Miscellaneous

- Jina - remove dead code (#422)
- Jina - review docstrings (#504)
- Make tests show coverage (#566)

## [integrations/jina-v0.2.0] - 2024-02-14

### 🚀 Features
Expand All @@ -39,26 +75,30 @@ This PR will also push the docs to Readme

- Update paths and titles (#397)

### Jina
### 🌀 Miscellaneous

- Update secrets management (#411)

## [integrations/jina-v0.1.0] - 2024-01-22

### 🐛 Bug Fixes

- Fix project urls (#96)


- Fix project URLs (#96)

### 🚜 Refactor

- Use `hatch_vcs` to manage integrations versioning (#103)

### ⚙️ Miscellaneous Tasks
### 🧹 Chores

- [**breaking**] Rename model_name to model in the Jina integration (#230)

### 🌀 Miscellaneous

- Change metadata to meta (#152)
- Optimize API key reading (#162)
- Refact!:change import paths (#254)

## [integrations/jina-v0.0.1] - 2023-12-11

### 🚀 Features
Expand Down
47 changes: 47 additions & 0 deletions integrations/jina/examples/jina_reader_connector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# to make use of the JinaReaderConnector, we first need to install the Haystack integration
# pip install jina-haystack

# then we must set the JINA_API_KEY environment variable
# export JINA_API_KEY=<your-api-key>


from haystack_integrations.components.connectors.jina import JinaReaderConnector

# we can use the JinaReaderConnector to process a URL and return the textual content of the page
reader = JinaReaderConnector(mode="read")
query = "https://example.com"
result = reader.run(query=query)

print(result)
# {'documents': [Document(id=fa3e51e4ca91828086dca4f359b6e1ea2881e358f83b41b53c84616cb0b2f7cf,
# content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...',
# meta: {'title': 'Example Domain', 'description': '', 'url': 'https://example.com/', 'usage': {'tokens': 42}})]}


# we can perform a web search by setting the mode to "search"
reader = JinaReaderConnector(mode="search")
query = "UEFA Champions League 2024"
result = reader.run(query=query)

print(result)
# {'documents': Document(id=6a71abf9955594232037321a476d39a835c0cb7bc575d886ee0087c973c95940,
# content: '2024/25 UEFA Champions League: Matches, draw, final, key dates | UEFA Champions League | UEFA.com...',
# meta: {'title': '2024/25 UEFA Champions League: Matches, draw, final, key dates',
# 'description': 'What are the match dates? Where is the 2025 final? How will the competition work?',
# 'url': 'https://www.uefa.com/uefachampionsleague/news/...',
# 'usage': {'tokens': 5581}}), ...]}


# finally, we can perform fact-checking by setting the mode to "ground" (experimental)
reader = JinaReaderConnector(mode="ground")
query = "ChatGPT was launched in 2017"
result = reader.run(query=query)

print(result)
# {'documents': [Document(id=f0c964dbc1ebb2d6584c8032b657150b9aa6e421f714cc1b9f8093a159127f0c,
# content: 'The statement that ChatGPT was launched in 2017 is incorrect. Multiple references confirm that ChatG...',
# meta: {'factuality': 0, 'result': False, 'references': [
# {'url': 'https://en.wikipedia.org/wiki/ChatGPT',
# 'keyQuote': 'ChatGPT is a generative artificial intelligence (AI) chatbot developed by OpenAI and launched in 2022.',
# 'isSupportive': False}, ...],
# 'usage': {'tokens': 10188}})]}
1 change: 1 addition & 0 deletions integrations/jina/pydoc/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ loaders:
"haystack_integrations.components.embedders.jina.document_embedder",
"haystack_integrations.components.embedders.jina.text_embedder",
"haystack_integrations.components.rankers.jina.ranker",
"haystack_integrations.components.connectors.jina.reader",
]
ignore_when_discovered: ["__init__"]
processors:
Expand Down
7 changes: 6 additions & 1 deletion integrations/jina/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -132,18 +132,23 @@ ban-relative-imports = "parents"
[tool.ruff.lint.per-file-ignores]
# Tests can use magic values, assertions, and relative imports
"tests/**/*" = ["PLR2004", "S101", "TID252"]
# examples can contain "print" commands
"examples/**/*" = ["T201"]

[tool.coverage.run]
source = ["haystack_integrations"]
branch = true
parallel = false


[tool.coverage.report]
omit = ["*/tests/*", "*/__init__.py"]
show_missing = true
exclude_lines = ["no cov", "if __name__ == .__main__.:", "if TYPE_CHECKING:"]

[tool.pytest.ini_options]
minversion = "6.0"
markers = ["unit: unit tests", "integration: integration tests"]

[[tool.mypy.overrides]]
module = ["haystack.*", "haystack_integrations.*", "pytest.*"]
ignore_missing_imports = true
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0
from .reader import JinaReaderConnector
from .reader_mode import JinaReaderMode

__all__ = ["JinaReaderConnector", "JinaReaderMode"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

import json
from typing import Any, Dict, List, Optional, Union
from urllib.parse import quote

import requests
from haystack import Document, component, default_from_dict, default_to_dict
from haystack.utils import Secret, deserialize_secrets_inplace

from .reader_mode import JinaReaderMode

READER_ENDPOINT_URL_BY_MODE = {
JinaReaderMode.READ: "https://r.jina.ai/",
JinaReaderMode.SEARCH: "https://s.jina.ai/",
JinaReaderMode.GROUND: "https://g.jina.ai/",
}


@component
class JinaReaderConnector:
"""
A component that interacts with Jina AI's reader service to process queries and return documents.
This component supports different modes of operation: `read`, `search`, and `ground`.
Usage example:
```python
from haystack_integrations.components.connectors.jina import JinaReaderConnector
reader = JinaReaderConnector(mode="read")
query = "https://example.com"
result = reader.run(query=query)
document = result["documents"][0]
print(document.content)
>>> "This domain is for use in illustrative examples..."
```
"""

def __init__(
self,
mode: Union[JinaReaderMode, str],
api_key: Secret = Secret.from_env_var("JINA_API_KEY"), # noqa: B008
json_response: bool = True,
):
"""
Initialize a JinaReader instance.
:param mode: The operation mode for the reader (`read`, `search` or `ground`).
- `read`: process a URL and return the textual content of the page.
- `search`: search the web and return textual content of the most relevant pages.
- `ground`: call the grounding engine to perform fact checking.
For more information on the modes, see the [Jina Reader documentation](https://jina.ai/reader/).
:param api_key: The Jina API key. It can be explicitly provided or automatically read from the
environment variable JINA_API_KEY (recommended).
:param json_response: Controls the response format from the Jina Reader API.
If `True`, requests a JSON response, resulting in Documents with rich structured metadata.
If `False`, requests a raw response, resulting in one Document with minimal metadata.
"""
self.api_key = api_key
self.json_response = json_response

if isinstance(mode, str):
mode = JinaReaderMode.from_str(mode)
self.mode = mode

def to_dict(self) -> Dict[str, Any]:
"""
Serializes the component to a dictionary.
:returns:
Dictionary with serialized data.
"""
return default_to_dict(
self,
api_key=self.api_key.to_dict(),
mode=str(self.mode),
json_response=self.json_response,
)

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "JinaReaderConnector":
"""
Deserializes the component from a dictionary.
:param data:
Dictionary to deserialize from.
:returns:
Deserialized component.
"""
deserialize_secrets_inplace(data["init_parameters"], keys=["api_key"])
return default_from_dict(cls, data)

def _json_to_document(self, data: dict) -> Document:
"""
Convert a JSON response/record to a Document, depending on the reader mode.
"""
if self.mode == JinaReaderMode.GROUND:
content = data.pop("reason")
else:
content = data.pop("content")
document = Document(content=content, meta=data)
return document

@component.output_types(document=List[Document])
def run(self, query: str, headers: Optional[Dict[str, str]] = None):
"""
Process the query/URL using the Jina AI reader service.
:param query: The query string or URL to process.
:param headers: Optional headers to include in the request for customization. Refer to the
[Jina Reader documentation](https://jina.ai/reader/) for more information.
:returns:
A dictionary with the following keys:
- `documents`: A list of `Document` objects.
"""
headers = headers or {}
headers["Authorization"] = f"Bearer {self.api_key.resolve_value()}"

if self.json_response:
headers["Accept"] = "application/json"

endpoint_url = READER_ENDPOINT_URL_BY_MODE[self.mode]
encoded_target = quote(query, safe="")
url = f"{endpoint_url}{encoded_target}"

response = requests.get(url, headers=headers, timeout=60)

# raw response: we just return a single Document with text
if not self.json_response:
meta = {"content_type": response.headers["Content-Type"], "query": query}
return {"documents": [Document(content=response.content, meta=meta)]}

response_json = json.loads(response.content).get("data", {})
if self.mode == JinaReaderMode.SEARCH:
documents = [self._json_to_document(record) for record in response_json]
return {"documents": documents}

return {"documents": [self._json_to_document(response_json)]}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# SPDX-FileCopyrightText: 2023-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0
from enum import Enum


class JinaReaderMode(Enum):
"""
Enum representing modes for the Jina Reader.
Modes:
READ: Process a URL and return the textual content of the page.
SEARCH: Search the web and return the textual content of the most relevant pages.
GROUND: Call the grounding engine to perform fact checking.
"""

READ = "read"
SEARCH = "search"
GROUND = "ground"

def __str__(self):
return self.value

@classmethod
def from_str(cls, string: str) -> "JinaReaderMode":
"""
Create the reader mode from a string.
:param string:
String to convert.
:returns:
Reader mode.
"""
enum_map = {e.value: e for e in JinaReaderMode}
reader_mode = enum_map.get(string)
if reader_mode is None:
msg = f"Unknown reader mode '{string}'. Supported modes are: {list(enum_map.keys())}"
raise ValueError(msg)
return reader_mode
Loading

0 comments on commit 864e1e8

Please sign in to comment.