-
Notifications
You must be signed in to change notification settings - Fork 15.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
community[minor]: Add Baichuan Text Embedding Model and Baichuan Inc …
…introduction (#16568) - **Description:** Adding Baichuan Text Embedding Model and Baichuan Inc introduction. Baichuan Text Embedding ranks #1 in C-MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard Co-authored-by: BaiChuanHelper <[email protected]>
- Loading branch information
1 parent
5b5115c
commit 70ff54e
Showing
7 changed files
with
252 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Baichuan | ||
|
||
>[Baichuan Inc.](https://www.baichuan-ai.com/) is a Chinese startup in the era of AGI, dedicated to addressing fundamental human needs: Efficiency, Health, and Happiness. | ||
## Visit Us | ||
Visit us at https://www.baichuan-ai.com/. | ||
Register and get an API key if you are trying out our APIs. | ||
|
||
## Baichuan Chat Model | ||
An example is available at [example](/docs/integrations/chat/baichuan). | ||
|
||
## Baichuan Text Embedding Model | ||
An example is available at [example] (/docs/integrations/text_embedding/baichuan) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Baichuan Text Embeddings\n", | ||
"\n", | ||
"As of today (Jan 25th, 2024) BaichuanTextEmbeddings ranks #1 in C-MTEB (Chinese Multi-Task Embedding Benchmark) leaderboard.\n", | ||
"\n", | ||
"Leaderboard (Under Overall -> Chinese section): https://huggingface.co/spaces/mteb/leaderboard\n", | ||
"\n", | ||
"Official Website: https://platform.baichuan-ai.com/docs/text-Embedding\n", | ||
"An API-key is required to use this embedding model. You can get one by registering at https://platform.baichuan-ai.com/docs/text-Embedding.\n", | ||
"BaichuanTextEmbeddings support 512 token window and preduces vectors with 1024 dimensions. \n", | ||
"\n", | ||
"Please NOTE that BaichuanTextEmbeddings only supports Chinese text embedding. Multi-language support is coming soon.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "plaintext" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain_community.embeddings import BaichuanTextEmbeddings\n", | ||
"\n", | ||
"# Place your Baichuan API-key here.\n", | ||
"embeddings = BaichuanTextEmbeddings(baichuan_api_key=\"sk-*\")\n", | ||
"\n", | ||
"text_1 = \"今天天气不错\"\n", | ||
"text_2 = \"今天阳光很好\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "plaintext" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"query_result = embeddings.embed_query(text_1)\n", | ||
"query_result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "plaintext" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"doc_result = embeddings.embed_documents([text_1, text_2])\n", | ||
"doc_result" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
113 changes: 113 additions & 0 deletions
113
libs/community/langchain_community/embeddings/baichuan.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
from typing import Any, Dict, List, Optional | ||
|
||
import requests | ||
from langchain_core.embeddings import Embeddings | ||
from langchain_core.pydantic_v1 import BaseModel, SecretStr, root_validator | ||
from langchain_core.utils import convert_to_secret_str, get_from_dict_or_env | ||
|
||
BAICHUAN_API_URL: str = "http://api.baichuan-ai.com/v1/embeddings" | ||
|
||
# BaichuanTextEmbeddings is an embedding model provided by Baichuan Inc. (https://www.baichuan-ai.com/home). | ||
# As of today (Jan 25th, 2024) BaichuanTextEmbeddings ranks #1 in C-MTEB | ||
# (Chinese Multi-Task Embedding Benchmark) leaderboard. | ||
# Leaderboard (Under Overall -> Chinese section): https://huggingface.co/spaces/mteb/leaderboard | ||
|
||
# Official Website: https://platform.baichuan-ai.com/docs/text-Embedding | ||
# An API-key is required to use this embedding model. You can get one by registering | ||
# at https://platform.baichuan-ai.com/docs/text-Embedding. | ||
# BaichuanTextEmbeddings support 512 token window and preduces vectors with | ||
# 1024 dimensions. | ||
|
||
|
||
# NOTE!! BaichuanTextEmbeddings only supports Chinese text embedding. | ||
# Multi-language support is coming soon. | ||
class BaichuanTextEmbeddings(BaseModel, Embeddings): | ||
"""Baichuan Text Embedding models.""" | ||
|
||
session: Any #: :meta private: | ||
model_name: str = "Baichuan-Text-Embedding" | ||
baichuan_api_key: Optional[SecretStr] = None | ||
|
||
@root_validator(allow_reuse=True) | ||
def validate_environment(cls, values: Dict) -> Dict: | ||
"""Validate that auth token exists in environment.""" | ||
try: | ||
baichuan_api_key = convert_to_secret_str( | ||
get_from_dict_or_env(values, "baichuan_api_key", "BAICHUAN_API_KEY") | ||
) | ||
except ValueError as original_exc: | ||
try: | ||
baichuan_api_key = convert_to_secret_str( | ||
get_from_dict_or_env( | ||
values, "baichuan_auth_token", "BAICHUAN_AUTH_TOKEN" | ||
) | ||
) | ||
except ValueError: | ||
raise original_exc | ||
session = requests.Session() | ||
session.headers.update( | ||
{ | ||
"Authorization": f"Bearer {baichuan_api_key.get_secret_value()}", | ||
"Accept-Encoding": "identity", | ||
"Content-type": "application/json", | ||
} | ||
) | ||
values["session"] = session | ||
return values | ||
|
||
def _embed(self, texts: List[str]) -> Optional[List[List[float]]]: | ||
"""Internal method to call Baichuan Embedding API and return embeddings. | ||
Args: | ||
texts: A list of texts to embed. | ||
Returns: | ||
A list of list of floats representing the embeddings, or None if an | ||
error occurs. | ||
""" | ||
try: | ||
response = self.session.post( | ||
BAICHUAN_API_URL, json={"input": texts, "model": self.model_name} | ||
) | ||
# Check if the response status code indicates success | ||
if response.status_code == 200: | ||
resp = response.json() | ||
embeddings = resp.get("data", []) | ||
# Sort resulting embeddings by index | ||
sorted_embeddings = sorted(embeddings, key=lambda e: e.get("index", 0)) | ||
# Return just the embeddings | ||
return [result.get("embedding", []) for result in sorted_embeddings] | ||
else: | ||
# Log error or handle unsuccessful response appropriately | ||
print( | ||
f"""Error: Received status code {response.status_code} from | ||
embedding API""" | ||
) | ||
return None | ||
except Exception as e: | ||
# Log the exception or handle it as needed | ||
print(f"Exception occurred while trying to get embeddings: {str(e)}") | ||
return None | ||
|
||
def embed_documents(self, texts: List[str]) -> Optional[List[List[float]]]: | ||
"""Public method to get embeddings for a list of documents. | ||
Args: | ||
texts: The list of texts to embed. | ||
Returns: | ||
A list of embeddings, one for each text, or None if an error occurs. | ||
""" | ||
return self._embed(texts) | ||
|
||
def embed_query(self, text: str) -> Optional[List[float]]: | ||
"""Public method to get embedding for a single query text. | ||
Args: | ||
text: The text to embed. | ||
Returns: | ||
Embeddings for the text, or None if an error occurs. | ||
""" | ||
result = self._embed([text]) | ||
return result[0] if result is not None else None |
19 changes: 19 additions & 0 deletions
19
libs/community/tests/integration_tests/embeddings/test_baichuan.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
"""Test Baichuan Text Embedding.""" | ||
from langchain_community.embeddings.baichuan import BaichuanTextEmbeddings | ||
|
||
|
||
def test_baichuan_embedding_documents() -> None: | ||
"""Test Baichuan Text Embedding for documents.""" | ||
documents = ["今天天气不错", "今天阳光灿烂"] | ||
embedding = BaichuanTextEmbeddings() | ||
output = embedding.embed_documents(documents) | ||
assert len(output) == 2 | ||
assert len(output[0]) == 1024 | ||
|
||
|
||
def test_baichuan_embedding_query() -> None: | ||
"""Test Baichuan Text Embedding for query.""" | ||
document = "所有的小学生都会学过只因兔同笼问题。" | ||
embedding = BaichuanTextEmbeddings() | ||
output = embedding.embed_query(document) | ||
assert len(output) == 1024 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters