-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pathway vectorstore and rag-pathway template #14859
Conversation
--------- Co-authored-by: mlewandowski <[email protected]> Co-authored-by: Berke <[email protected]> Co-authored-by: Jan Chorowski <[email protected]> Co-authored-by: Adrian Kosowski <[email protected]>
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not import this into langchain - langchain should remain unchanged
only langchain-community should be updated, and we should import directly from there
|
||
from typing import Callable, List, Optional | ||
|
||
import pathway as pw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a conditional import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the omission, all should be optional now.
Change to a conditional import. --------- Co-authored-by: mlewandowski <[email protected]>
Fix documentation markdown formatting. --------- Co-authored-by: mlewandowski <[email protected]>
It was done as follows: 1. fetch fresh langchain master 2. `poetry add --optional pathway@latest --python ">=3.10"` 3. `poetry lock --no-update`
@hwchase17 we have fixed poetry lock and used type annotations suitable for Py3.8, can you re-trigger the CI run? |
@efriis I tried to fix the formatting, now CI should be clean. |
@efriis we have simplified the PR, leaving only the client and changing the instruction for a quick start using a publicly available server, then pointing to instructions on how to run it. The template is also removed, and we have removed the Please review! |
@efriis I fixed linters |
@efriis please trigger CI, we resolved a merge conflict |
Head branch was pushed to by a user without write access
@efriis @baskaryan sorry to bother you, I re-merged master again and rerun linters. On my end locally |
- **Description:** Integration with pathway.com data processing pipeline acting as an always updated vectorstore - **Issue:** not applicable - **Dependencies:** optional dependency on [`pathway`](https://pypi.org/project/pathway/) - **Twitter handle:** pathway_com The PR provides and integration with `pathway` to provide an easy to use always updated vector store: ```python import pathway as pw from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import PathwayVectorClient, PathwayVectorServer data_sources = [] data_sources.append( pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True)) text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) embeddings_model = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"]) vector_server = PathwayVectorServer( *data_sources, embedder=embeddings_model, splitter=text_splitter, ) vector_server.run_server(host="127.0.0.1", port="8765", threaded=True, with_cache=False) client = PathwayVectorClient( host="127.0.0.1", port="8765", ) query = "What is Pathway?" docs = client.similarity_search(query) ``` The `PathwayVectorServer` builds a data processing pipeline which continusly scans documents in a given source connector (google drive, s3, ...) and builds a vector store. The `PathwayVectorClient` implements LangChain's `VectorStore` interface and connects to the server to retrieve documents. --------- Co-authored-by: Mateusz Lewandowski <[email protected]> Co-authored-by: mlewandowski <[email protected]> Co-authored-by: Berke <[email protected]> Co-authored-by: Adrian Kosowski <[email protected]> Co-authored-by: mlewandowski <[email protected]> Co-authored-by: berkecanrizai <[email protected]> Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Harrison Chase <[email protected]> Co-authored-by: Bagatur <[email protected]> Co-authored-by: mlewandowski <[email protected]> Co-authored-by: Szymon Dudycz <[email protected]> Co-authored-by: Szymon Dudycz <[email protected]> Co-authored-by: Bagatur <[email protected]>
- **Description:** Integration with pathway.com data processing pipeline acting as an always updated vectorstore - **Issue:** not applicable - **Dependencies:** optional dependency on [`pathway`](https://pypi.org/project/pathway/) - **Twitter handle:** pathway_com The PR provides and integration with `pathway` to provide an easy to use always updated vector store: ```python import pathway as pw from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import PathwayVectorClient, PathwayVectorServer data_sources = [] data_sources.append( pw.io.gdrive.read(object_id="17H4YpBOAKQzEJ93xmC2z170l0bP2npMy", service_user_credentials_file="credentials.json", with_metadata=True)) text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) embeddings_model = OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"]) vector_server = PathwayVectorServer( *data_sources, embedder=embeddings_model, splitter=text_splitter, ) vector_server.run_server(host="127.0.0.1", port="8765", threaded=True, with_cache=False) client = PathwayVectorClient( host="127.0.0.1", port="8765", ) query = "What is Pathway?" docs = client.similarity_search(query) ``` The `PathwayVectorServer` builds a data processing pipeline which continusly scans documents in a given source connector (google drive, s3, ...) and builds a vector store. The `PathwayVectorClient` implements LangChain's `VectorStore` interface and connects to the server to retrieve documents. --------- Co-authored-by: Mateusz Lewandowski <[email protected]> Co-authored-by: mlewandowski <[email protected]> Co-authored-by: Berke <[email protected]> Co-authored-by: Adrian Kosowski <[email protected]> Co-authored-by: mlewandowski <[email protected]> Co-authored-by: berkecanrizai <[email protected]> Co-authored-by: Erick Friis <[email protected]> Co-authored-by: Harrison Chase <[email protected]> Co-authored-by: Bagatur <[email protected]> Co-authored-by: mlewandowski <[email protected]> Co-authored-by: Szymon Dudycz <[email protected]> Co-authored-by: Szymon Dudycz <[email protected]> Co-authored-by: Bagatur <[email protected]>
pathway
The PR provides and integration with
pathway
to provide an easy to use always updated vector store:The
PathwayVectorServer
builds a data processing pipeline which continusly scans documents in a given source connector (google drive, s3, ...) and builds a vector store. ThePathwayVectorClient
implements LangChain'sVectorStore
interface and connects to the server to retrieve documents.