diff --git a/docs/docs/community/integrations/vector_stores.md b/docs/docs/community/integrations/vector_stores.md
index 5de5ca4a94749..a6a375f0b749b 100644
--- a/docs/docs/community/integrations/vector_stores.md
+++ b/docs/docs/community/integrations/vector_stores.md
@@ -39,6 +39,7 @@ as the storage backend for `VectorStoreIndex`.
- TiDB (`TiDBVectorStore`). [Quickstart](../../examples/vector_stores/TiDBVector.ipynb). [Installation](https://tidb.cloud/ai). [Python Client](https://github.com/pingcap/tidb-vector-python).
- TimeScale (`TimescaleVectorStore`). [Installation](https://github.com/timescale/python-vector).
- Upstash (`UpstashVectorStore`). [Quickstart](https://upstash.com/docs/vector/overall/getstarted)
+- Vertex AI Vector Search (`VertexAIVectorStore`). [Quickstart](https://cloud.google.com/vertex-ai/docs/vector-search/quickstart)
- Weaviate (`WeaviateVectorStore`). [Installation](https://weaviate.io/developers/weaviate/installation). [Python Client](https://weaviate.io/developers/weaviate/client-libraries/python).
- Zep (`ZepVectorStore`). [Installation](https://docs.getzep.com/deployment/quickstart/). [Python Client](https://docs.getzep.com/sdk/).
- Zilliz (`MilvusVectorStore`). [Quickstart](https://zilliz.com/doc/quick_start)
@@ -656,6 +657,19 @@ from llama_index.vector_stores.upstash import UpstashVectorStore
vector_store = UpstashVectorStore(url="YOUR_URL", token="YOUR_TOKEN")
```
+**Vertex AI Vector Search**
+
+```python
+from llama_index.vector_stores.vertexaivectorsearch import VertexAIVectorStore
+
+vector_store = VertexAIVectorStore(
+ project_id="[your-google-cloud-project-id]",
+ region="[your-google-cloud-region]",
+ index_id="[your-index-resource-name]",
+ endpoint_id="[your-index-endpoint-name]",
+)
+```
+
**Weaviate**
```python
@@ -846,6 +860,7 @@ documents = reader.load_data(
- [Lantern](../../examples/vector_stores/LanternIndexDemo.ipynb)
- [Metal](../../examples/vector_stores/MetalIndexDemo.ipynb)
- [Milvus](../../examples/vector_stores/MilvusIndexDemo.ipynb)
+- [Milvus Hybrid Search](../../examples/vector_stores/MilvusHybridIndexDemo.ipynb)
- [MyScale](../../examples/vector_stores/MyScaleIndexDemo.ipynb)
- [ElsaticSearch](../../examples/vector_stores/ElasticsearchIndexDemo.ipynb)
- [FAISS](../../examples/vector_stores/FaissIndexDemo.ipynb)
diff --git a/docs/docs/examples/data_connectors/WebPageDemo.ipynb b/docs/docs/examples/data_connectors/WebPageDemo.ipynb
index 0b1f036317aa2..dce71d477e4df 100644
--- a/docs/docs/examples/data_connectors/WebPageDemo.ipynb
+++ b/docs/docs/examples/data_connectors/WebPageDemo.ipynb
@@ -130,6 +130,91 @@
"display(Markdown(f\"{response}\"))"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "6e7b0a56",
+ "metadata": {},
+ "source": [
+ "# Using Spider Reader 🕷\n",
+ "[Spider](https://spider.cloud/?ref=llama_index) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.\n",
+ "\n",
+ "Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc... \n",
+ "\n",
+ "**Prerequisites:** you need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bdfb59f7",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[Document(id_='54a6ecf3-b33e-41e9-8cec-48657aa2ed9b', embedding=None, metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 101750, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider - Fastest Web Crawler[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 5secs ###To crawl 200 pages### 21x ###Faster than FireCrawl### 150x ###Faster than Apify Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labeling[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Scrape single URL\n",
+ "from llama_index.readers.web import SpiderWebReader\n",
+ "\n",
+ "spider_reader = SpiderWebReader(\n",
+ " api_key=\"YOUR_API_KEY\", # Get one at https://spider.cloud\n",
+ " mode=\"scrape\",\n",
+ " # params={} # Optional parameters see more on https://spider.cloud/docs/api\n",
+ ")\n",
+ "\n",
+ "documents = spider_reader.load_data(url=\"https://spider.cloud\")\n",
+ "print(documents)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "780b794e",
+ "metadata": {},
+ "source": [
+ "Crawl domain following all deeper subpages"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "80c10c79",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[Document(id_='63f7ccbf-c6c8-4f69-80f7-f6763f761a39', embedding=None, metadata={'description': 'Our privacy policy and how it plays a part in the data collected.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 26647, 'keywords': None, 'pathname': '/privacy', 'resource_type': 'html', 'title': 'Privacy', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/privacy.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=\"Privacy[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Privacy Policy==========Learn about how we take privacy with the Spider project.[Spider](https://spider.cloud) offers a cutting-edge data scraping service with powerful AI capabilities. Our data collecting platform is designed to help users maximize the benefits of data collection while embracing the advancements in AI technology. With our innovative tools, we provide a seamless and fast interactive experience. This privacy policy details Spider's approach to product development, deployment, and usage, encompassing the Crawler, AI products, and features.[AI Development at Spider----------](#ai-development-at-spider)Spider leverages a robust combination of proprietary code, open-source frameworks, and synthetic datasets to train its cutting-edge products. To continuously improve our offerings, Spider may utilize inputs from user-generated prompts and content, obtained from trusted third-party providers. By harnessing this diverse data, Spider can deliver highly precise and pertinent recommendations to our valued users. While the foundational data crawling aspect of Spider is openly available on Github, the dashboard and AI components remain closed source. Spider respects all robots.txt files declared on websites allowing data to be extracted without harming the website.[Security, Privacy, and Trust----------](#security-privacy-and-trust)At Spider, our utmost priority is the development and implementation of Crawlers, AI technologies, and products that adhere to ethical, moral, and legal standards. We are dedicated to creating a secure and respectful environment for all users. Safeguarding user data and ensuring transparency in its usage are core principles we uphold. In line with this commitment, we provide the following important disclosures when utilizing our AI-related products:* Spider ensures comprehensive disclosure of features that utilize third-party AI platforms. To provide clarity, these integrations will be clearly indicated through distinct markers, designations, explanatory notes that appear when hovering, references to the underlying codebase, or any other suitable form of notification as determined by the system. Our commitment to transparency aims to keep users informed about the involvement of third-party AI platforms in our products.* We collect and use personal data as set forth in our [Privacy Policy](https://spider.cloud/privacy) which governs the collection and usage of personal data. If you choose to input personal data into our AI products, please be aware that such information may be processed through third-party AI providers. For any inquiries or concerns regarding data privacy, feel free to reach out to us at [Spider Help Github](https://github.com/orgs/spider-rs/discussions). We are here to assist you.* Except for user-generated prompts and/or content as inputs, Spider does not use customer data, including the code related to the use of Spider's deployment services, to train or finetune any models used.* We periodically review and update our policies and procedures in an effort to comply with applicable data protection regulations and industry standards.* We use reasonable measures designed to maintain the safety of users and avoid harm to people and the environment. Spider's design and development process includes considerations for ethical, security, and regulatory requirements with certain safeguards to prevent and report misuse or abuse.[Third-Party Service Providers----------](#third-party-service-providers)In providing AI products and services, we leverage various third-party providers in the AI space to enhance our services and capabilities, and will continue to do so for certain product features.This page will be updated from time to time with information about Spider's use of AI. The current list of third-party AI providers integrated into Spider is as follows:* [Anthropic](https://console.anthropic.com/legal/terms)* [Azure Cognitive Services](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/data-privacy)* [Cohere](https://cohere.com/terms-of-use)* [ElevenLabs](https://elevenlabs.io/terms)* [Hugging Face](https://huggingface.co/terms-of-service)* [Meta AI](https://www.facebook.com/policies_center/)* [OpenAI](https://openai.com/policies)* [Pinecone](https://www.pinecone.io/terms)* [Replicate](https://replicate.com/terms)We prioritize the safety of our users and take appropriate measures to avoid harm both to individuals and the environment. Our design and development processes incorporate considerations for ethical practices, security protocols, and regulatory requirements, along with established safeguards to prevent and report any instances of misuse or abuse. We are committed to maintaining a secure and respectful environment and upholding responsible practices throughout our services.[Acceptable Use----------](#acceptable-use)Spider's products are intended to provide helpful and respectful responses to user prompts and queries while collecting data along the web. We don't allow the use of our Scraper or AI tools, products and services for the following usages:* Denial of Service Attacks* Illegal activity* Inauthentic, deceptive, or impersonation behavior* Any other use that would violate Spider's standard published policies, codes of conduct, or terms of service.Any violation of this Spider AI Policy or any Spider policies or terms of service may result in termination of use of services at Spider's sole discretion. We will review and update this Spider AI Policy so that it remains relevant and effective. If you have feedback or would like to report any concerns or issues related to the use of AI systems, please reach out to [support@spider.cloud](mailto:support@spider.cloud).[More Information----------](#more-information)To learn more about Spider's integration of AI capabilities into products and features, check out the following resources:* [Spider-Rust](https://github.com/spider-rs)* [Spider](/)* [About](/)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='18e4d35d-ff48-4d00-b924-abab7a06fbec', embedding=None, metadata={'description': 'Learn how to crawl and scrape websites with the fastest web crawler built for the job.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 27058, 'keywords': None, 'pathname': '/guides', 'resource_type': 'html', 'title': 'Spider Guides', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider Guides[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Spider Guides==========Learn how to crawl and scrape websites easily.(4) Total Guides* [ Spider v1 Logo Spider Platform ---------- How to use the platform to collect data from the internet fast, affordable, and unblockable. ](/guides/spider)* [ Spider v1 Logo Spider API ---------- How to use the Spider API to curate data from any source blazing fast. The most advanced crawler that handles all workloads of all sizes. ](/guides/spider-api)* [ Spider v1 Logo Extract Contacts ---------- Get contact information from any website in real time with AI. The only way to accurately get dynamic information from websites. ](/guides/pipelines-extract-contacts)* [ Spider v1 Logo Website Archiving ---------- The programmable time machine that can store pages and all assets for easy website archiving. ](/guides/website-archiving)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='b10c6402-bc35-4fec-b97c-fa30bde54ce8', embedding=None, metadata={'description': 'Complete reference documentation for the Spider API. Includes code snippets and examples for quickly getting started with the system.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 195426, 'keywords': None, 'pathname': '/docs/api', 'resource_type': 'html', 'title': 'Spider API Reference', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/docs*_*api.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider API Reference[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)API Reference==========The Spider API is based on REST. Our API is predictable, returns [JSON-encoded](http://www.json.org/) responses, uses standard HTTP response codes, authentication, and verbs. Set your API secret key in the `authorization` header to commence. You can use the `content-type` header with `application/json`, `application/xml`, `text/csv`, and `application/jsonl` for shaping the response.The Spider API supports multi domain actions. You can work with multiple domains per request by adding the urls comma separated.The Spider API differs for every account as we release new versions and tailor functionality. You can add `v1` before any path to pin to the version.Just getting started?----------Check out our [development quickstart](/guides/spider-api) guide.Not a developer?----------Use Spiders [no-code options or apps](/guides/spider) to get started with Spider and to do more with your Spider account no code required.Base UrlJSONCopy```https://api.spider.cloud```Crawl websites==========Start crawling a website(s) to collect resources.POST https://api.spider.cloud/crawlRequest body* url\\xa0required\\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\\\_agent\\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\\\_data\\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\\\_config\\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps. Set Example* fingerprint\\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\\\_format\\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\\\_enabled\\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\\\_selector\\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\\\_resources\\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\\\_timeout\\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\\\_in\\\\_background\\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { \"content\": \"...\", \"error\": null, \"status\": 200, \"url\": \"http://www.example.com\" }, // more content...]```Crawl websites get links==========Start crawling a website(s) to collect links found.POST https://api.spider.cloud/linksRequest body* url\\xa0required\\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\\\_agent\\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\\\_data\\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\\\_config\\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps. Set Example* fingerprint\\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\\\_format\\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\\\_enabled\\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\\\_selector\\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\\\_resources\\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\\\_timeout\\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\\\_in\\\\_background\\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/links\\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { \"content\": \"\", \"error\": null, \"status\": 200, \"url\": \"http://www.example.com\" }, // more content...]```Screenshot websites==========Start taking screenshots of website(s) to collect images to base64 or binary.POST https://api.spider.cloud/screenshotRequest bodyGeneralSpecific* url\\xa0required\\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\\\_agent\\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\\\_data\\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\\\_config\\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps. Set Example* fingerprint\\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\\\_format\\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\\\_enabled\\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\\\_selector\\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\\\_resources\\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\\\_timeout\\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\\\_in\\\\_background\\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/screenshot\\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { \"content\": \"base64...\", \"error\": null, \"status\": 200, \"url\": \"http://www.example.com\" }, // more content...]```Pipelines----------Create powerful workflows with our pipeline API endpoints. Use AI to extract contacts from any website or filter links with prompts with ease.Crawl websites and extract contacts==========Start crawling a website(s) to collect all contacts found leveraging AI.POST https://api.spider.cloud/pipeline/extract-contactsRequest bodyGeneralSpecific* url\\xa0required\\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\\\_agent\\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\\\_data\\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\\\_config\\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps. Set Example* fingerprint\\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\\\_format\\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\\\_enabled\\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\\\_selector\\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\\\_resources\\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\\\_timeout\\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\\\_in\\\\_background\\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/pipeline/extract-contacts\\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { \"content\": [{ \"full_name\": \"John Doe\", \"email\": \"johndoe@gmail.com\", \"phone\": \"555-555-555\", \"title\": \"Baker\"}, ...], \"error\": null, \"status\": 200, \"url\": \"http://www.example.com\" }, // more content...]```Label website==========Crawl a website and accurately categorize it using AI.POST https://api.spider.cloud/pipeline/labelRequest bodyGeneralSpecific* url\\xa0required\\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test Url* request\\xa0string ---------- The request type to perform. Possible values are `http`, `chrome`, and `smart`. Use `smart` to perform HTTP request by default until JavaScript rendering is needed for the HTML. HTTP* limit\\xa0number ---------- The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages. Crawl Limit* depth\\xa0number ---------- The crawl limit for maximum depth. If zero, no limit will be applied. Crawl DepthSet Example* cache\\xa0boolean ---------- Use HTTP caching for the crawl to speed up repeated runs. Set Example* budget\\xa0object ---------- Object that has paths with a counter for limiting the amount of pages example `{\"*\":1}` for only crawling the root page. The wildcard matches all routes and you can set child paths preventing a depth level, example of limiting `{ \"/docs/colors\": 10, \"/docs/\": 100 }` which only allows a max of 100 pages if the route matches `/docs/:pathname` and only 10 pages if it matches `/docs/colors/:pathname`. Crawl Budget Set Example* locale\\xa0string ---------- The locale to use for request, example `en-US`. Set Example* cookies\\xa0string ---------- Add HTTP cookies to use for request. Set Example* stealth\\xa0boolean ---------- Use stealth mode for headless chrome request to help prevent being blocked. The default is enabled on chrome. Set Example* headers\\xa0string ---------- Forward HTTP headers to use for all request. The object is expected to be a map of key value pairs. Set Example* metadata\\xa0boolean ---------- Boolean to store metadata about the pages and content found. This could help improve AI interopt. Defaults to false unless you have the website already stored with the configuration enabled. Set Example* viewport\\xa0object ---------- Configure the viewport for chrome. Defaults to 800x600. Set Example* encoding\\xa0string ---------- The type of encoding to use like `UTF-8`, `SHIFT_JIS`, or etc. Set Example* subdomains\\xa0boolean ---------- Allow subdomains to be included. Set Example* user\\\\_agent\\xa0string ---------- Add a custom HTTP user agent to the request. Set Example* store\\\\_data\\xa0boolean ---------- Boolean to determine if storage should be used. If set this takes precedence over `storageless`. Defaults to false. Set Example* gpt\\\\_config\\xa0object ---------- Use AI to generate actions to perform during the crawl. You can pass an array for the`\"prompt\"` to chain steps. Set Example* fingerprint\\xa0boolean ---------- Use advanced fingerprint for chrome. Set Example* storageless\\xa0boolean ---------- Boolean to prevent storing any type of data for the request including storage and AI vectors embedding. Defaults to false unless you have the website already stored. Set Example* readability\\xa0boolean ---------- Use [readability](https://github.com/mozilla/readability) to pre-process the content for reading. This may drastically improve the content for LLM usage. Set Example* return\\\\_format\\xa0string ---------- The format to return the data in. Possible values are `markdown`, `raw`, `text`, and `html2text`. Use `raw` to return the default format of the page like `HTML` etc. Raw* proxy\\\\_enabled\\xa0boolean ---------- Enable high performance premium proxies for the request to prevent being blocked at the network level. Set Example* query\\\\_selector\\xa0string ---------- The CSS query selector to use when extracting content from the markup. Test Query Selector* full\\\\_resources\\xa0boolean ---------- Crawl and download all the resources for a website. Set Example* request\\\\_timeout\\xa0number ---------- The timeout to use for request. Timeouts can be from 5-60. The default is 30 seconds. Set Example* run\\\\_in\\\\_background\\xa0boolean ---------- Run the request in the background. Useful if storing data and wanting to trigger crawls to the dashboard. This has no effect if storageless is set. Set ExampleShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/pipeline/label\\', headers=headers, json=json_data)print(response.json())```ResponseCopy```[ { \"content\": [\"Government\"], \"error\": null, \"status\": 200, \"url\": \"http://www.example.com\" }, // more content...]```Crawl State==========Get the state of the crawl for the domain.POST https://api.spider.cloud/crawl/statusRequest body* url\\xa0required\\xa0string ---------- The URI resource to crawl. This can be a comma split list for multiple urls. Test UrlShow More Properties* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}response = requests.post(\\'https://api.spider.cloud/crawl/status\\', headers=headers)print(response.json())```ResponseCopy``` { \"content\": { \"data\": { \"id\": \"195bf2f2-2821-421d-b89c-f27e57ca71fh\", \"user_id\": \"6bd06efa-bb0a-4f1f-a29f-05db0c4b1bfg\", \"domain\": \"example.com\", \"url\": \"https://example.com/\", \"links\":1, \"credits_used\": 3, \"mode\":2, \"crawl_duration\": 340, \"message\": null, \"request_user_agent\": \"Spider\", \"level\": \"info\", \"status_code\": 0, \"created_at\": \"2024-04-21T01:21:32.886863+00:00\", \"updated_at\": \"2024-04-21T01:21:32.886863+00:00\" }, \"error\": \"\" }, \"error\": null, \"status\": 200, \"url\": \"http://www.example.com\" }```Credits Available==========Get the remaining credits available.GET https://api.spider.cloud/credits* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}response = requests.post(\\'https://api.spider.cloud/credits\\', headers=headers)print(response.json())```ResponseCopy```{ \"credits\": 52566 }```[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='44b350c3-f907-4767-84ec-a73fe59c190c', embedding=None, metadata={'description': 'End User License Agreement for the Spiderwebai and the spider project.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 20123, 'keywords': None, 'pathname': '/eula', 'resource_type': 'html', 'title': 'EULA', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/eula.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='EULA[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)End User License Agreement==========Our end user license agreement may change from time to time as we build out the software.Right to Ban----------Part of making sure the Spider is being used for the right purpose we will not allow malicious acts to be done with the system. If we find that you are using the tool to hack, crawl illegal pages, porn, or anything that falls into this line will be banned from the system. You can reach out to us to weigh out your reasons on why you should not be banned.License----------You can use the API and service to build ontop of. Replicating the features and re-selling the service is not allowed. We do not provide any custom license for the platform and encourage users to use our system to handle any crawling, scraping, or data curation needs for speed and cost effectiveness.### Adjustments to Plans ###The software is very new and while we figure out what we can charge to maintain the systems the plans may change. We will send out a notification of the changes in our [Discord](https://discord.gg/5bDPDxwTn3) or Github. For the most part plans will increase drastically with things set to scale costs that allow more usage for everyone. Spider is a product of[A11yWatch LLC](https://a11ywatch.com) the web accessibility tool. The crawler engine of Spider powers the curation for A11yWatch allowing auditing websites accessibility compliance extremely fast.#### Contact ####For information about how to contact Spider, please reach out to email below.[support@spider.cloud](mailto:support@spider.cloud)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='445c0c76-bfd5-4f89-a439-fbdeb8077a4c', embedding=None, metadata={'description': 'Spider is the fastest web crawler written in Rust. The Cloud version is a hosted version of open-source project.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 139080, 'keywords': None, 'pathname': '/about', 'resource_type': 'html', 'title': 'About', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/about.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='About[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) About==========Spider is the fastest web crawler written in Rust. The Cloud version is a hosted version of open-source project. Spider Features----------Our features that facilitate website scraping and provide swift insights in one platform. Deliver astonishing results using our powerful API.### Fast Unblockable Scraping ###When it comes to speed, the Spider project is the fastest web crawler available to the public. Utilize the foundation of open-source tools and make the most of your budget to scrape content effectively.Collecting Data Logo### Gain Website Insights with AI ###Enhance your crawls with AI to obtain relevant information fast from any website.AI Search### Extract Data Using Webhooks ###Set up webhooks across your websites to deliver the desired information anywhere you need.News Logo[A11yWatch](https://a11ywatch.com)maintains the project and the hosting for the service.[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='1a2d63a5-0315-4c5b-8fed-8ac460b82cc7', embedding=None, metadata={'description': 'Add the amount of credits you want to purchase for scraping the internet with AI and LLM data curation abilities fast.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 23083, 'keywords': None, 'pathname': '/credits/new', 'resource_type': 'html', 'title': 'Purchase Spider Credits', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/credits*_*new.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Purchase Spider Credits[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Add credits==========Add credits to start crawling any website today.|Default| Features | Amount ||-------|--------------------|--------------------||Default| Scraping Websites |$0.03 / gb bandwidth|| Extra | Premium Proxies |$0.01 / gb bandwidth|| Extra |Javascript Rendering|$0.01 / gb bandwidth|| Extra | Data Storage | $0.30 / gb month || Extra | AI Chat | $0.01 input/output |[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='6701b47a-0000-4111-8b5b-c77b01937a7d', embedding=None, metadata={'description': 'Collect data rapidly from any website. Seamlessly scrape websites and get data tailored for LLM workloads.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 101750, 'keywords': None, 'pathname': '/', 'resource_type': 'html', 'title': 'Spider - Fastest Web Crawler', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/index.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Spider - Fastest Web Crawler[Spider v1 Logo Spider ](/)[Pricing](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)The World\\'s Fastest and Cheapest Crawler API==========View Demo* Basic* StreamingExample requestPythonCopy```import requests, osheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":50,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl\\', headers=headers, json=json_data)print(response.json())```Example ResponseUnmatched Speed----------### 5secs ###To crawl 200 pages### 21x ###Faster than FireCrawl### 150x ###Faster than Apify Benchmarks displaying performance between Spider Cloud, Firecrawl, and Apify.[See framework benchmarks ](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md)Foundations for Crawling Effectively----------### Leading in performance ###Spider is written in Rust and runs in full concurrency to achieve crawling dozens of pages in secs.### Optimal response format ###Get clean and formatted markdown, HTML, or text content for fine-tuning or training AI models.### Caching ###Further boost speed by caching repeated web page crawls.### Smart Mode ###Spider dynamically switches to Headless Chrome when it needs to.Beta### Scrape with AI ###Do custom browser scripting and data extraction using the latest AI models.### Best crawler for LLMs ###Don\\'t let crawling and scraping be the highest latency in your LLM & AI agent stack.### Scrape with no headaches ###* Proxy rotations* Agent headers* Avoid anti-bot detections* Headless chrome* Markdown LLM Responses### The Fastest Web Crawler ###* Powered by [spider-rs](https://github.com/spider-rs/spider)* Do 20,000 pages in seconds* Full concurrency* Powerful and simple API* 5,000 requests per minute### Do more with AI ###* Custom browser scripting* Advanced data extraction* Data pipelines* Perfect for LLM and AI Agents* Accurate website labeling[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='91b98a80-7112-4837-8389-cb78221b254c', embedding=None, metadata={'description': 'Get contact information from any website in real time with AI. The only way to accurately get dynamic information from websites.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 25891, 'keywords': None, 'pathname': '/guides/pipelines-extract-contacts', 'resource_type': 'html', 'title': 'Guides - Extract Contacts', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*pipelines-extract-contacts.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Extract Contacts[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Extract Contacts==========Contents----------* [Seamless extracting any contact any website](#seamless-extracting-any-contact-any-website)* [UI (Extracting Contacts)](#ui-extracting-contacts)* [API Extracting Usage](#api-extracting-usage) * [API Extracting Example](#api-extracting-example) * [Pipelines Combo](#pipelines-combo)Seamless extracting any contact any website----------Extracting contacts from a website used to be a very difficult challenge involving many steps that would change often. The challenges typically faced involve being able to get the data from a website without being blocked and setting up query selectors for the information you need using javascript. This would often break in two folds - the data extracting with a correct stealth technique or the css selector breaking as they update the website HTML code. Now we toss those two hard challenges away - one of them spider takes care of and the other the advancement in AI to process and extract information.UI (Extracting Contacts)----------You can use the UI on the dashboard to extract contacts after you crawled a page. Go to the page youwant to extract and click on the horizontal dropdown menu to display an option to extract the contact.The crawl will get the data first to see if anything new has changed. Afterwards if a contact was found usually within 10-60 seconds you will get a notification that the extraction is complete with the data.![Extracting contacts with the spider app](/img/app/extract-contacts.png)After extraction if the page has contact related data you can view it with a grid in the app.![The menu displaying the found contacts after extracting with the spider app](/img/app/extract-contacts-found.png)The grid will display the name, email, phone, title, and host(website found) of the contact(s).![Grid display of all the contact information found for the web page](/img/app/extract-contacts-grid.png)API Extracting Usage----------The endpoint `/pipeline/extract-contacts` provides the ability to extract all contacts from a website concurrently.### API Extracting Example ###To extract contacts from a website you can follow the example below. All params are optional except `url`. Use the `prompt` param to adjust the way the AI handles the extracting. If you use the param `store_data` or if the website already exist in the dashboard the contact data will be saved with the page.```import requests, os, jsonheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":1,\"url\":\"http://www.example.com/contacts\", \"model\": \"gpt-4-1106-preview\", \"prompt\": \"A custom prompt to tailor the extracting.\"}response = requests.post(\\'https://api.spider.cloud/crawl/pipeline/extract-contacts\\', headers=headers, json=json_data, stream=True)for line in response.iter_lines(): if line: print(json.loads(line))```### Pipelines Combo ###Piplines bring a whole new entry to workflows for data curation, if you combine the API endpoints to only use the extraction on pages you know may have contacts can save credits on the system. One way would be to perform gathering all the links first with the `/links` endpoint. After getting the links for the pages use `/pipeline/filter-links` with a custom prompt that can use AI to reduce the noise of the links to process before `/pipline/extract-contacts`.Loading graph...Written on: 2/1/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='5e7ade0d-0a50-46de-8116-72ee5dca0b20', embedding=None, metadata={'description': 'How to use the Spider API to curate data from any source blazing fast. The most advanced crawler that handles all workloads of all sizes.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 24752, 'keywords': None, 'pathname': '/guides/spider-api', 'resource_type': 'html', 'title': 'Guides - Spider API', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*spider-api.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Spider API[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Getting started Spider API==========Contents----------* [API built to scale](#api-built-to-scale)* [API Usage](#api-usage)* [Crawling One Page](#crawling-one-page)* [Crawling Multiple Pages](#crawling-multiple-pages) * [Planet Scale Crawling](#planet-scale-crawling) * [Automatic Configuration](#automatic-configuration)API built to scale----------Welcome to our cutting-edge web crawler SaaS, renowned for its unparalleled speed.Our platform is designed to effortlessly manage thousands of requests per second, thanks to our elastically scalable system architecture and the Open-Source [spider](https://github.com/spider-rs/spider) project. We deliver consistent latency times ensuring swift processing for all responses.For an in-depth understanding of the request parameters supported, we invite you to explore our comprehensive API documentation. At present, we do not provide client-side libraries, as our API has been crafted with simplicity in mind for straightforward usage. However, we are open to expanding our offerings in the future to enhance user convenience.Dive into our [documentation]((/docs/api)) to get started and unleash the full potential of our web crawler today.API Usage----------Getting started with the API is simple and straight forward. After you get your [secret key](/api-keys)you can access our instance directly. We have one main endpoint `/crawl` that handles all things relatedto data curation. The crawler is highly configurable through the params to fit all needs.Crawling One Page----------Most cases you probally just want to crawl one page. Even if you only need one page, our system performs fast enough to lead the race.The most straight forward way to make sure you only crawl a single page is to set the [budget limit](./account/settings) with a wild card value or `*` to 1.You can also pass in the param `limit` in the JSON body with the limit of pages.Crawling Multiple Pages----------When you crawl multiple pages, the concurrency horsepower of the spider kicks in. You might wonder why and how one request may take (x)ms to come back, and 100 requests take about the same time! That’s because the built-in isolated concurrency allows for crawling thousands to millions of pages in no time. It’s the only current solution that can handle large websites with over 100k pages within a minute or two (sometimes even in a blink or two). By default, we do not add any limits to crawls unless specified.### Planet Scale Crawling ###If you plan on processing crawls that have over 200 pages, we recommend streaming the request from the client instead of parsing the entire payload once finished. We have an example of this with Python on the API docs page, also shown below.```import requests, os, jsonheaders = { \\'Authorization\\': os.environ[\"SPIDER_API_KEY\"], \\'Content-Type\\': \\'application/json\\',}json_data = {\"limit\":250,\"url\":\"http://www.example.com\"}response = requests.post(\\'https://api.spider.cloud/crawl/crawl\\', headers=headers, json=json_data, stream=True)for line in response.iter_lines(): if line: print(json.loads(line))```#### Automatic Configuration ####Spider handles automatic concurrency handling and ip rotation to make it simple to curate data.The more credits you have or usage available allows for a higher concurrency limit.Written on: 1/3/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='08e5f1d6-4ae7-4b68-ab96-4b6a3768e88c', embedding=None, metadata={'description': 'The programmable time machine that can store pages and all assets for easy website archiving.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 18970, 'keywords': None, 'pathname': '/guides/website-archiving', 'resource_type': 'html', 'title': 'Guides - Website Archiving', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*website-archiving.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Website Archiving[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Website Archiving==========With Spider you can easily backup or capture a website at any point in time.Enable Full Resource storing in the settings or website configuration to get a 1:1 copy of any websitelocally.Time Machine----------Time machine is storing data at a certain point of a time. Spider brings this to you with one simple configuration.After running the crawls you can simply download the data. This can help store assets incase the code is lost orversion control is removed.Written on: 2/7/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='024cb27e-21d2-49a5-8a1a-963e72038421', embedding=None, metadata={'description': 'How to use the platform to collect data from the internet fast, affordable, and unblockable.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 24666, 'keywords': None, 'pathname': '/guides/spider', 'resource_type': 'html', 'title': 'Guides - Spider Platform', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/guides*_*spider.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Guides - Spider Platform[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)Getting started collecting data with Spider==========Contents----------* [Data Curation](#data-curation) * [Crawling (Website)](#crawling-website) * [Crawling (API)](#crawling-api)* [Crawl Configuration](#crawl-configuration) * [Proxies](#proxies) * [Headless Browser](#headless-browser) * [Crawl Budget Limits](#crawl-budget-limits)* [Crawling and Scraping Websites](#crawling-and-scraping-websites) * [Transforming Data](#transforming-data) * [Leveraging Open Source](#leveraging-open-source)* [Subscription and Spider Credits](#subscription-and-spider-credits)Data Curation----------Collecting data with Spider can be fast and rewarding if done with some simple preliminary steps.Use the dashboard to collect data seamlessly across the internet with scheduled updates.You have two main ways of collecting data using Spider. The first and simplest is to use the UI available for scraping.The alternative is to use the API to programmatically access the system and perform actions.### Crawling (Website) ###1. Register or login to your account using email or Github.2. Purchase [credits](/credits/new) to kickstart crawls with `pay-as-you-go` go after credits deplete.3. Configure crawl [settings](/account/settings) to fit workflows that you need.4. Navigate to the [dashboard](/) and enter a website url or ask a question to get a url that should be crawled.5. Crawl the website and export/download the data as needed.### Crawling (API) ###1. Register or login to your account using email or Github.2. Purchase [credits](/credits/new) to kickstart crawls with `pay-as-you-go` after credits deplete.3. Configure crawl [settings](/account/settings) to fit workflows that you need.4. Navigate to [API keys](/api-keys) and create a new secret key.5. Go to the [API docs](/docs/api) page to see how the API works and perform crawls with code examples.Crawl Configuration----------Configuration your account for how you would like to crawl can help save costs or effectiveness of the content. Some of the configurations include setting Premium Proxies, Headless Browser Rendering, Webhooks, and Budgeting.### Proxies ###Using proxies with our system is straight forward. Simple check the toggle on if you want all request to use a proxy to increase the success of not being blocked.![Proxies example app screenshot.](/img/app/proxy-setting.png)### Headless Browser ###If you want pages that require JavaScript to be executed the headless browser config is for you. Enabling will run all request through a real Chrome Browser for JavaScript required rendering pages.![Headless browser example app screenshot.](/img/app/headless-browser.png)### Crawl Budget Limits ###One of the key things you may need to do before getting into the crawl is setting up crawl-budgets.Crawl budgets allows you to determine how many pages you are going to crawl for a website.Determining the budget will save you costs when dealing with large websites that you only want certain data points from. The example below shows adding a asterisk (\\\\*) to determine all routes with a limit of 50 pages maximum. The settings can be overwritten by the website configuration or parameters if using the API.![Crawl budget example screenshot](/img/app/edit-budget.png)Crawling and Scraping Websites----------Collecting data can be done in many ways and for many reasons. Leveraging our state-of-the-art technology allows you to create fast workloads that can process content from multiple locations. At the time of writing, we have started to focus on our data processing API instead of the dashboard. The API has much more flexibility than the UI for performing advanced workloads like batching, formatting, and so on.![Dashboard UI for Spider displaying data collecting from www.napster.com, jeffmendez.com, rsseau.rs, and www.drake.com](/img/app/ui-crawl.png)### Transforming Data ###The API has more features for gathering the content in different formats and transforming the HTML as needed. You can transform the content from HTML to Markdown and feed it to a LLM for better handling the learning aspect. The API is the first class citizen for the application. The UI will have the features provided by the API eventually as the need arises.#### Leveraging Open Source ####One of the reasons Spider is the ultimate data-curation service for scraping is from the power of Open-Source. The core of the engine is completly available on [Github](https://github.com/spider-rs/spider) under [MIT](https://opensource.org/license/mit/) to show what is in store. We are constantly working on the crawler features including performance with plans to maintain the project for the long run.Subscription and Spider Credits----------The platform allows purchasing credits that gives you the ability to crawl at any time.When you purchase credits a crawl subscription is created that allows you to continue to usethe platform when your credits deplete. The limits provided coralate with the amount of creditspurchased, an example would be if you bought $5 in credits you would have about $40 in spending limit - $10 in credit gives $80 and so on.The highest purchase of credits directly determines how much is allowed on the platform. You can view your usage and credits on the [usage limits page](/account/usage).Written on: 1/2/2024[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='44bff527-c7f3-4346-a2f8-1454c52e1b01', embedding=None, metadata={'description': 'Generate API keys that allow access to the system programmatically anywhere. Full management access for your Spider API journey.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 28770, 'keywords': None, 'pathname': '/api-keys', 'resource_type': 'html', 'title': 'API Keys Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/api-keys.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=\"API Keys Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) API Keys==========Generate API keys that allow access to the system programmatically anywhere. Full management access for your Spider API journey. Key Management----------Your secret API keys are listed below. Please note that we do not display your secret API keys again after you generate them.Do not share your API key with others, or expose it in the browser or other client-side code. In order to protect the security of your account, Spider may also automatically disable any API key that we've found has leaked publicly.Filter Name...Columns| Name |Key|Created|Last Used| ||-----------|---|-------|---------|---||No results.| | | | |0 of 0 row(s) selected.PreviousNext[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='e577c57a-2376-452f-8c39-04d1e284595c', embedding=None, metadata={'description': 'Explore your usage and set limits that work with your budget.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 21195, 'keywords': None, 'pathname': '/account/usage', 'resource_type': 'html', 'title': 'Usage - Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/account*_*usage.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=\"Usage - Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider) Usage limit==========Below you'll find a summary of usage for your account. The data may be delayed up to 5 minutes.Credits----------### Pay as you go ###### Approved usage limit ### The maximum usage Spider allows for your organization each month. Ask for increase.### Set a monthly budget ###When your organization reaches this usage threshold each month, subsequent requests will be rejected. Data may be deleted if payments are rejected.[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)\", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), Document(id_='e3eb1e3c-5080-4590-94e8-fd2ef4f6d3c6', embedding=None, metadata={'description': 'Adjust your spider settings to adjust your crawl settings.', 'domain': 'spider.cloud', 'extracted_data': None, 'file_size': 18322, 'keywords': None, 'pathname': '/account/settings', 'resource_type': 'html', 'title': 'Settings - Spider', 'url': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48/spider.cloud/account*_*settings.html', 'user_id': '48f1bc3c-3fbb-408a-865b-c191a1bb1f48'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Settings - Spider[Spider v1 Logo Spider ](/) [Credits](/credits/new)[GitHubGithub637](https://github.com/spider-rs/spider)[API](/docs/api) [Pricing](/credits/new) [Guides](/guides) [About](/about) [Docs](https://docs.rs/spider/latest/spider/) [Privacy](/privacy) [Terms](/eula)© 2024 Spider from A11yWatchTheme Light Dark Toggle Theme [GitHubGithub](https://github.com/spider-rs/spider)', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n')]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Crawl domain with deeper crawling following subpages\n",
+ "from llama_index.readers.web import SpiderWebReader\n",
+ "\n",
+ "spider_reader = SpiderWebReader(\n",
+ " api_key=\"YOUR_API_KEY\",\n",
+ " mode=\"crawl\",\n",
+ " # params={} # Optional parameters see more on https://spider.cloud/docs/api\n",
+ ")\n",
+ "\n",
+ "documents = spider_reader.load_data(url=\"https://spider.cloud\")\n",
+ "print(documents)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "36f671f6",
+ "metadata": {},
+ "source": [
+ "For guides and documentation, visit [Spider](https://spider.cloud/docs/api)"
+ ]
+ },
{
"cell_type": "markdown",
"id": "005d14cd",
diff --git a/docs/docs/examples/retrievers/bedrock_retriever.ipynb b/docs/docs/examples/retrievers/bedrock_retriever.ipynb
index 016e93ef2ac9f..fbefc4c52085c 100644
--- a/docs/docs/examples/retrievers/bedrock_retriever.ipynb
+++ b/docs/docs/examples/retrievers/bedrock_retriever.ipynb
@@ -71,7 +71,7 @@
"outputs": [],
"source": [
"query = \"How big is Milky Way as compared to the entire universe?\"\n",
- "retrieved_results = retriever._retrieve(query)\n",
+ "retrieved_results = retriever.retrieve(query)\n",
"\n",
"# Prints the first retrieved result\n",
"print(retrieved_results[0].get_content())"
diff --git a/docs/docs/examples/vector_stores/VespaIndexDemo.ipynb b/docs/docs/examples/vector_stores/VespaIndexDemo.ipynb
new file mode 100644
index 0000000000000..9d3c3e59fd194
--- /dev/null
+++ b/docs/docs/examples/vector_stores/VespaIndexDemo.ipynb
@@ -0,0 +1,537 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "d792fb5b",
+ "metadata": {},
+ "source": [
+ "\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "23cf319b",
+ "metadata": {},
+ "source": [
+ "\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "307804a3-c02b-4a57-ac0d-172c30ddc851",
+ "metadata": {},
+ "source": [
+ "# Vespa Vector Store demo\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "5508d8ac",
+ "metadata": {},
+ "source": [
+ "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0beb6603",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%pip install llama-index-vector-stores-vespa llama-index pyvespa"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "f7010b1d-d1bb-4f08-9309-a328bb4ea396",
+ "metadata": {},
+ "source": [
+ "#### Setting up API key\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "08ad68ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import openai\n",
+ "\n",
+ "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\"\n",
+ "openai.api_key = os.environ[\"OPENAI_API_KEY\"]"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "8ee4473a-094f-4d0a-a825-e1213db07240",
+ "metadata": {},
+ "source": [
+ "#### Load documents, build the VectorStoreIndex\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0a2bcc07",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core import VectorStoreIndex\n",
+ "from llama_index.vector_stores.vespa import VespaVectorStore\n",
+ "from IPython.display import Markdown, display"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "3a41a70d",
+ "metadata": {},
+ "source": [
+ "## Defining some sample data\n",
+ "\n",
+ "Let's insert some documents.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "df6b6d46",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core.schema import TextNode\n",
+ "\n",
+ "nodes = [\n",
+ " TextNode(\n",
+ " text=\"The Shawshank Redemption\",\n",
+ " metadata={\n",
+ " \"author\": \"Stephen King\",\n",
+ " \"theme\": \"Friendship\",\n",
+ " \"year\": 1994,\n",
+ " },\n",
+ " ),\n",
+ " TextNode(\n",
+ " text=\"The Godfather\",\n",
+ " metadata={\n",
+ " \"director\": \"Francis Ford Coppola\",\n",
+ " \"theme\": \"Mafia\",\n",
+ " \"year\": 1972,\n",
+ " },\n",
+ " ),\n",
+ " TextNode(\n",
+ " text=\"Inception\",\n",
+ " metadata={\n",
+ " \"director\": \"Christopher Nolan\",\n",
+ " \"theme\": \"Fiction\",\n",
+ " \"year\": 2010,\n",
+ " },\n",
+ " ),\n",
+ " TextNode(\n",
+ " text=\"To Kill a Mockingbird\",\n",
+ " metadata={\n",
+ " \"author\": \"Harper Lee\",\n",
+ " \"theme\": \"Mafia\",\n",
+ " \"year\": 1960,\n",
+ " },\n",
+ " ),\n",
+ " TextNode(\n",
+ " text=\"1984\",\n",
+ " metadata={\n",
+ " \"author\": \"George Orwell\",\n",
+ " \"theme\": \"Totalitarianism\",\n",
+ " \"year\": 1949,\n",
+ " },\n",
+ " ),\n",
+ " TextNode(\n",
+ " text=\"The Great Gatsby\",\n",
+ " metadata={\n",
+ " \"author\": \"F. Scott Fitzgerald\",\n",
+ " \"theme\": \"The American Dream\",\n",
+ " \"year\": 1925,\n",
+ " },\n",
+ " ),\n",
+ " TextNode(\n",
+ " text=\"Harry Potter and the Sorcerer's Stone\",\n",
+ " metadata={\n",
+ " \"author\": \"J.K. Rowling\",\n",
+ " \"theme\": \"Fiction\",\n",
+ " \"year\": 1997,\n",
+ " },\n",
+ " ),\n",
+ "]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "31fe4378",
+ "metadata": {},
+ "source": [
+ "### Initilizing the VespaVectorStore\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a0be7d09",
+ "metadata": {},
+ "source": [
+ "To make it really simple to get started, we provide a template Vespa application that will be deployed upon initializing the vector store.\n",
+ "\n",
+ "This is a huge abstraction and there are endless opportunities to tailor and customize the Vespa application to your needs. But for now, let's keep it simple and initialize with the default template.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "30b0b2e3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core import StorageContext\n",
+ "\n",
+ "vector_store = VespaVectorStore()\n",
+ "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
+ "index = VectorStoreIndex(nodes, storage_context=storage_context)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "71a4a3ec",
+ "metadata": {},
+ "source": [
+ "### Deleting documents\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f4637a79",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "node_to_delete = nodes[0].node_id\n",
+ "node_to_delete"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "84a97903",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "vector_store.delete(ref_doc_id=node_to_delete)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "03315550",
+ "metadata": {},
+ "source": [
+ "## Querying\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "74cabf95",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core.vector_stores.types import (\n",
+ " VectorStoreQuery,\n",
+ " VectorStoreQueryMode,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d6401e25",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = VectorStoreQuery(\n",
+ " query_str=\"Great Gatsby\",\n",
+ " mode=VectorStoreQueryMode.TEXT_SEARCH,\n",
+ " similarity_top_k=1,\n",
+ ")\n",
+ "result = vector_store.query(query)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "09f1bf81",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "result"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "77d2528e",
+ "metadata": {},
+ "source": [
+ "## As retriever\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a8d7aca1",
+ "metadata": {},
+ "source": [
+ "### Default query mode (text search)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5a71818e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "retriever = index.as_retriever(vector_store_query_mode=\"default\")\n",
+ "results = retriever.retrieve(\"Who directed inception?\")\n",
+ "display(Markdown(f\"**Retrieved nodes:**\\n {results}\"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bfe83ebf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "retriever = index.as_retriever(vector_store_query_mode=\"semantic_hybrid\")\n",
+ "results = retriever.retrieve(\"Who wrote Harry Potter?\")\n",
+ "display(Markdown(f\"**Retrieved nodes:**\\n {results}\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8aa36e8",
+ "metadata": {},
+ "source": [
+ "### As query engine\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c1bd18f8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query_engine = index.as_query_engine()\n",
+ "response = query_engine.query(\"Who directed inception?\")\n",
+ "display(Markdown(f\"**Response:** {response}\"))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "aede9cf6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query_engine = index.as_query_engine(\n",
+ " vector_store_query_mode=\"semantic_hybrid\", verbose=True\n",
+ ")\n",
+ "response = query_engine.query(\n",
+ " \"When was the book about the wizard boy published and what was it called?\"\n",
+ ")\n",
+ "display(Markdown(f\"**Response:** {response}\"))\n",
+ "display(Markdown(f\"**Sources:** {response.source_nodes}\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "90081efd",
+ "metadata": {},
+ "source": [
+ "## Using metadata filters\n",
+ "\n",
+ "**NOTE**: This metadata filtering is done by llama-index, outside of vespa. For native and much more performant filtering, you should use Vespa's own filtering capabilities.\n",
+ "\n",
+ "See [Vespa's documentation](https://docs.vespa.ai/en/reference/query-language-reference.html) for more information.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0663ab38",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from llama_index.core.vector_stores import (\n",
+ " FilterOperator,\n",
+ " FilterCondition,\n",
+ " MetadataFilter,\n",
+ " MetadataFilters,\n",
+ ")\n",
+ "\n",
+ "# Let's define a filter that will only allow nodes that has the theme \"Fiction\" OR is published after 1997\n",
+ "\n",
+ "filters = MetadataFilters(\n",
+ " filters=[\n",
+ " MetadataFilter(key=\"theme\", value=\"Fiction\"),\n",
+ " MetadataFilter(key=\"year\", value=1997, operator=FilterOperator.GT),\n",
+ " ],\n",
+ " condition=FilterCondition.OR,\n",
+ ")\n",
+ "\n",
+ "retriever = index.as_retriever(filters=filters)\n",
+ "result = retriever.retrieve(\"Harry Potter\")\n",
+ "display(Markdown(f\"**Result:** {result}\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "414e6d78",
+ "metadata": {},
+ "source": [
+ "## Abstraction level of this integration\n",
+ "\n",
+ "To make it really simple to get started, we provide a template Vespa application that will be deployed upon initializing the vector store. This removes some of the complexity of setting up Vespa for the first time, but for serious use cases, we strongly recommend that you read the [Vespa documentation](docs.vespa.ai) and tailor the application to your needs.\n",
+ "\n",
+ "### The template\n",
+ "\n",
+ "The provided template Vespa application can be seen below:\n",
+ "\n",
+ "```python\n",
+ "from vespa.package import (\n",
+ " ApplicationPackage,\n",
+ " Field,\n",
+ " Schema,\n",
+ " Document,\n",
+ " HNSW,\n",
+ " RankProfile,\n",
+ " Component,\n",
+ " Parameter,\n",
+ " FieldSet,\n",
+ " GlobalPhaseRanking,\n",
+ " Function,\n",
+ ")\n",
+ "\n",
+ "hybrid_template = ApplicationPackage(\n",
+ " name=\"hybridsearch\",\n",
+ " schema=[\n",
+ " Schema(\n",
+ " name=\"doc\",\n",
+ " document=Document(\n",
+ " fields=[\n",
+ " Field(name=\"id\", type=\"string\", indexing=[\"summary\"]),\n",
+ " Field(name=\"metadata\", type=\"string\", indexing=[\"summary\"]),\n",
+ " Field(\n",
+ " name=\"text\",\n",
+ " type=\"string\",\n",
+ " indexing=[\"index\", \"summary\"],\n",
+ " index=\"enable-bm25\",\n",
+ " bolding=True,\n",
+ " ),\n",
+ " Field(\n",
+ " name=\"embedding\",\n",
+ " type=\"tensor(x[384])\",\n",
+ " indexing=[\n",
+ " \"input text\",\n",
+ " \"embed\",\n",
+ " \"index\",\n",
+ " \"attribute\",\n",
+ " ],\n",
+ " ann=HNSW(distance_metric=\"angular\"),\n",
+ " is_document_field=False,\n",
+ " ),\n",
+ " ]\n",
+ " ),\n",
+ " fieldsets=[FieldSet(name=\"default\", fields=[\"text\", \"metadata\"])],\n",
+ " rank_profiles=[\n",
+ " RankProfile(\n",
+ " name=\"bm25\",\n",
+ " inputs=[(\"query(q)\", \"tensor(x[384])\")],\n",
+ " functions=[Function(name=\"bm25sum\", expression=\"bm25(text)\")],\n",
+ " first_phase=\"bm25sum\",\n",
+ " ),\n",
+ " RankProfile(\n",
+ " name=\"semantic\",\n",
+ " inputs=[(\"query(q)\", \"tensor(x[384])\")],\n",
+ " first_phase=\"closeness(field, embedding)\",\n",
+ " ),\n",
+ " RankProfile(\n",
+ " name=\"fusion\",\n",
+ " inherits=\"bm25\",\n",
+ " inputs=[(\"query(q)\", \"tensor(x[384])\")],\n",
+ " first_phase=\"closeness(field, embedding)\",\n",
+ " global_phase=GlobalPhaseRanking(\n",
+ " expression=\"reciprocal_rank_fusion(bm25sum, closeness(field, embedding))\",\n",
+ " rerank_count=1000,\n",
+ " ),\n",
+ " ),\n",
+ " ],\n",
+ " )\n",
+ " ],\n",
+ " components=[\n",
+ " Component(\n",
+ " id=\"e5\",\n",
+ " type=\"hugging-face-embedder\",\n",
+ " parameters=[\n",
+ " Parameter(\n",
+ " \"transformer-model\",\n",
+ " {\n",
+ " \"url\": \"https://github.com/vespa-engine/sample-apps/raw/master/simple-semantic-search/model/e5-small-v2-int8.onnx\"\n",
+ " },\n",
+ " ),\n",
+ " Parameter(\n",
+ " \"tokenizer-model\",\n",
+ " {\n",
+ " \"url\": \"https://raw.githubusercontent.com/vespa-engine/sample-apps/master/simple-semantic-search/model/tokenizer.json\"\n",
+ " },\n",
+ " ),\n",
+ " ],\n",
+ " )\n",
+ " ],\n",
+ ")\n",
+ "```\n",
+ "\n",
+ "Note that the fields `id`, `metadata`, `text`, and `embedding` are required for the integration to work.\n",
+ "The schema name must also be `doc`, and the rank profiles must be named `bm25`, `semantic`, and `fusion`.\n",
+ "\n",
+ "Other than that you are free to modify as you see fit by switching out embedding models, adding more fields, or changing the ranking expressions.\n",
+ "\n",
+ "For more details, check out this Pyvespa example notebook on [hybrid search](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html).\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/docs/module_guides/models/embeddings.md b/docs/docs/module_guides/models/embeddings.md
index 82d47ecafc2bc..cd6e73d041a65 100644
--- a/docs/docs/module_guides/models/embeddings.md
+++ b/docs/docs/module_guides/models/embeddings.md
@@ -24,6 +24,7 @@ Then:
```python
from llama_index.embeddings.openai import OpenAIEmbedding
+from llama_index.core import VectorStoreIndex
from llama_index.core import Settings
# global
diff --git a/docs/docs/module_guides/storing/vector_stores.md b/docs/docs/module_guides/storing/vector_stores.md
index d5c23be013ebb..b5ff6229fb110 100644
--- a/docs/docs/module_guides/storing/vector_stores.md
+++ b/docs/docs/module_guides/storing/vector_stores.md
@@ -38,7 +38,7 @@ We are actively adding more integrations and improving feature coverage for each
| Metal | cloud | ✓ | | ✓ | ✓ | |
| MongoDB Atlas | self-hosted / cloud | ✓ | | ✓ | ✓ | |
| MyScale | cloud | ✓ | ✓ | ✓ | ✓ | |
-| Milvus / Zilliz | self-hosted / cloud | ✓ | | ✓ | ✓ | |
+| Milvus / Zilliz | self-hosted / cloud | ✓ | ✓ | ✓ | ✓ | |
| Neo4jVector | self-hosted / cloud | ✓ | | ✓ | ✓ | |
| OpenSearch | self-hosted / cloud | ✓ | ✓ | ✓ | ✓ | ✓ |
| Pinecone | cloud | ✓ | ✓ | ✓ | ✓ | |
@@ -56,6 +56,7 @@ We are actively adding more integrations and improving feature coverage for each
| Typesense | self-hosted / cloud | ✓ | | ✓ | ✓ | |
| Upstash | cloud | | | | ✓ | |
| Vearch | self-hosted | ✓ | | ✓ | ✓ | |
+| Vertex AI Vector Search | cloud | ✓ | | ✓ | ✓ | |
| Weaviate | self-hosted / cloud | ✓ | ✓ | ✓ | ✓ | |
For more details, see [Vector Store Integrations](../../community/integrations/vector_stores.md).
@@ -82,6 +83,7 @@ For more details, see [Vector Store Integrations](../../community/integrations/v
- [Lantern](../../examples/vector_stores/LanternIndexDemo.ipynb)
- [Metal](../../examples/vector_stores/MetalIndexDemo.ipynb)
- [Milvus](../../examples/vector_stores/MilvusIndexDemo.ipynb)
+- [Milvus Hybrid Search](../../examples/vector_stores/MilvusHybridIndexDemo.ipynb)
- [MyScale](../../examples/vector_stores/MyScaleIndexDemo.ipynb)
- [ElsaticSearch](../../examples/vector_stores/ElasticsearchIndexDemo.ipynb)
- [FAISS](../../examples/vector_stores/FaissIndexDemo.ipynb)
@@ -104,6 +106,7 @@ For more details, see [Vector Store Integrations](../../community/integrations/v
- [Timesacle](../../examples/vector_stores/Timescalevector.ipynb)
- [Upstash](../../examples/vector_stores/UpstashVectorDemo.ipynb)
- [Vearch](../../examples/vector_stores/VearchDemo.ipynb)
+- [Vertex AI Vector Search](../../examples/vector_stores/VertexAIVectorSearchDemo.ipynb)
- [Weaviate](../../examples/vector_stores/WeaviateIndexDemo.ipynb)
- [Weaviate Hybrid Search](../../examples/vector_stores/WeaviateIndexDemo-Hybrid.ipynb)
- [Zep](../../examples/vector_stores/ZepIndexDemo.ipynb)
diff --git a/docs/docs/optimizing/advanced_retrieval/advanced_retrieval.md b/docs/docs/optimizing/advanced_retrieval/advanced_retrieval.md
index 039ba0604b69a..2bd3b4b59e759 100644
--- a/docs/docs/optimizing/advanced_retrieval/advanced_retrieval.md
+++ b/docs/docs/optimizing/advanced_retrieval/advanced_retrieval.md
@@ -47,3 +47,4 @@ Here are some third-party resources on advanced retrieval strategies.
- [DeepMemory (Activeloop)](../../examples/retrievers/deep_memory.ipynb)
- [Weaviate Hybrid Search](../../examples/vector_stores/WeaviateIndexDemo-Hybrid.ipynb)
- [Pinecone Hybrid Search](../../examples/vector_stores/PineconeIndexDemo-Hybrid.ipynb)
+- [Milvus Hybrid Search](../../examples/vector_stores/MilvusHybridIndexDemo.ipynb)
diff --git a/docs/docs/optimizing/basic_strategies/basic_strategies.md b/docs/docs/optimizing/basic_strategies/basic_strategies.md
index ad5db5e9febfd..3e7abbbb610c6 100644
--- a/docs/docs/optimizing/basic_strategies/basic_strategies.md
+++ b/docs/docs/optimizing/basic_strategies/basic_strategies.md
@@ -84,6 +84,7 @@ Relevant guides with both approaches can be found below:
- [Reciprocal Rerank Query Fusion](../../examples/retrievers/reciprocal_rerank_fusion.ipynb)
- [Weaviate Hybrid Search](../../examples/vector_stores/WeaviateIndexDemo-Hybrid.ipynb)
- [Pinecone Hybrid Search](../../examples/vector_stores/PineconeIndexDemo-Hybrid.ipynb)
+- [Milvus Hybrid Search](../../examples/vector_stores/MilvusHybridIndexDemo.ipynb)
## Metadata Filters
diff --git a/docs/docs/use_cases/agents.md b/docs/docs/use_cases/agents.md
index d8d616b406bd6..a4d7ec9b21875 100644
--- a/docs/docs/use_cases/agents.md
+++ b/docs/docs/use_cases/agents.md
@@ -19,7 +19,7 @@ LlamaIndex provides a comprehensive framework for building agents. This includes
The scope of possible use cases for agents is vast and ever-expanding. That said, here are some practical use cases that can deliver immediate value.
-- **Agentic RAG**: Build a context-augmented research assistant over your data that not only answers simple questions, but complex research tasks. Here are two resources ([resource 1](../understanding/putting_it_all_together/agents.md), [resource 2](./optimizing/agentic_strategies/agentic_strategies.md)) to help you get started.
+- **Agentic RAG**: Build a context-augmented research assistant over your data that not only answers simple questions, but complex research tasks. Here are two resources ([resource 1](../understanding/putting_it_all_together/agents.md), [resource 2](../optimizing/agentic_strategies/agentic_strategies.md)) to help you get started.
- **SQL Agent**: A subset of the above is a "text-to-SQL assistant" that can interact with a structured database. Check out [this guide](https://docs.llamaindex.ai/en/stable/examples/agent/agent_runner/query_pipeline_agent/?h=sql+agent#setup-simple-retry-agent-pipeline-for-text-to-sql) to see how to build an agent from scratch.
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 986eda9979761..13e846ed7b87c 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -20,7 +20,7 @@ idna==3.7
ipykernel==6.29.3
ipython==8.22.2
jedi==0.19.1
-Jinja2==3.1.3
+Jinja2==3.1.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter_client==8.6.1
diff --git a/llama-index-core/llama_index/core/composability/joint_qa_summary.py b/llama-index-core/llama_index/core/composability/joint_qa_summary.py
index db67d457ed759..c3ac03922b3ab 100644
--- a/llama-index-core/llama_index/core/composability/joint_qa_summary.py
+++ b/llama-index-core/llama_index/core/composability/joint_qa_summary.py
@@ -74,7 +74,7 @@ def __init__(
)
self._storage_context = storage_context or StorageContext.from_defaults()
- self._service_context = service_context or ServiceContext.from_defaults()
+ self._service_context = service_context
self._summary_text = summary_text
self._qa_text = qa_text
diff --git a/llama-index-core/llama_index/core/query_engine/router_query_engine.py b/llama-index-core/llama_index/core/query_engine/router_query_engine.py
index 9f01599015661..194ba29cf2865 100644
--- a/llama-index-core/llama_index/core/query_engine/router_query_engine.py
+++ b/llama-index-core/llama_index/core/query_engine/router_query_engine.py
@@ -157,6 +157,7 @@ def from_defaults(
return cls(
selector,
query_engine_tools,
+ llm=llm,
service_context=service_context,
summarizer=summarizer,
**kwargs,
diff --git a/llama-index-core/llama_index/core/utilities/sql_wrapper.py b/llama-index-core/llama_index/core/utilities/sql_wrapper.py
index dbe74bdc0a553..6d9a0c275d038 100644
--- a/llama-index-core/llama_index/core/utilities/sql_wrapper.py
+++ b/llama-index-core/llama_index/core/utilities/sql_wrapper.py
@@ -217,6 +217,7 @@ def run_sql(self, command: str) -> Tuple[str, Dict]:
try:
if self._schema:
command = command.replace("FROM ", f"FROM {self._schema}.")
+ command = command.replace("JOIN ", f"JOIN {self._schema}.")
cursor = connection.execute(text(command))
except (ProgrammingError, OperationalError) as exc:
raise NotImplementedError(
diff --git a/llama-index-core/poetry.lock b/llama-index-core/poetry.lock
index 78559fb3f7fd0..0dbf429538007 100644
--- a/llama-index-core/poetry.lock
+++ b/llama-index-core/poetry.lock
@@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 1.6.1 and should not be changed by hand.
+# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
[[package]]
name = "accelerate"
@@ -2203,13 +2203,13 @@ testing = ["Django", "attrs", "colorama", "docopt", "pytest (<7.0.0)"]
[[package]]
name = "jinja2"
-version = "3.1.3"
+version = "3.1.4"
description = "A very fast and expressive template engine."
optional = false
python-versions = ">=3.7"
files = [
- {file = "Jinja2-3.1.3-py3-none-any.whl", hash = "sha256:7d6d50dd97d52cbc355597bd845fabfbac3f551e1f99619e39a35ce8c370b5fa"},
- {file = "Jinja2-3.1.3.tar.gz", hash = "sha256:ac8bd6544d4bb2c9792bf3a159e80bba8fda7f07e81bc3aed565432d5925ba90"},
+ {file = "jinja2-3.1.4-py3-none-any.whl", hash = "sha256:bc5dd2abb727a5319567b7a813e6a2e7318c39f4f487cfe6c89c6f9c7d25197d"},
+ {file = "jinja2-3.1.4.tar.gz", hash = "sha256:4a3aee7acbbe7303aede8e9648d13b8bf88a429282aa6122a993f0ac800cb369"},
]
[package.dependencies]
@@ -2954,7 +2954,6 @@ files = [
{file = "lxml-5.2.1-cp37-cp37m-musllinux_1_2_x86_64.whl", hash = "sha256:9e2addd2d1866fe112bc6f80117bcc6bc25191c5ed1bfbcf9f1386a884252ae8"},
{file = "lxml-5.2.1-cp37-cp37m-win32.whl", hash = "sha256:f51969bac61441fd31f028d7b3b45962f3ecebf691a510495e5d2cd8c8092dbd"},
{file = "lxml-5.2.1-cp37-cp37m-win_amd64.whl", hash = "sha256:b0b58fbfa1bf7367dde8a557994e3b1637294be6cf2169810375caf8571a085c"},
- {file = "lxml-5.2.1-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:3e183c6e3298a2ed5af9d7a356ea823bccaab4ec2349dc9ed83999fd289d14d5"},
{file = "lxml-5.2.1-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:804f74efe22b6a227306dd890eecc4f8c59ff25ca35f1f14e7482bbce96ef10b"},
{file = "lxml-5.2.1-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:08802f0c56ed150cc6885ae0788a321b73505d2263ee56dad84d200cab11c07a"},
{file = "lxml-5.2.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0f8c09ed18ecb4ebf23e02b8e7a22a05d6411911e6fabef3a36e4f371f4f2585"},
@@ -4640,6 +4639,7 @@ files = [
{file = "psycopg2_binary-2.9.9-cp311-cp311-win32.whl", hash = "sha256:dc4926288b2a3e9fd7b50dc6a1909a13bbdadfc67d93f3374d984e56f885579d"},
{file = "psycopg2_binary-2.9.9-cp311-cp311-win_amd64.whl", hash = "sha256:b76bedd166805480ab069612119ea636f5ab8f8771e640ae103e05a4aae3e417"},
{file = "psycopg2_binary-2.9.9-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:8532fd6e6e2dc57bcb3bc90b079c60de896d2128c5d9d6f24a63875a95a088cf"},
+ {file = "psycopg2_binary-2.9.9-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:b0605eaed3eb239e87df0d5e3c6489daae3f7388d455d0c0b4df899519c6a38d"},
{file = "psycopg2_binary-2.9.9-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8f8544b092a29a6ddd72f3556a9fcf249ec412e10ad28be6a0c0d948924f2212"},
{file = "psycopg2_binary-2.9.9-cp312-cp312-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2d423c8d8a3c82d08fe8af900ad5b613ce3632a1249fd6a223941d0735fce493"},
{file = "psycopg2_binary-2.9.9-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2e5afae772c00980525f6d6ecf7cbca55676296b580c0e6abb407f15f3706996"},
@@ -4648,6 +4648,8 @@ files = [
{file = "psycopg2_binary-2.9.9-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:cb16c65dcb648d0a43a2521f2f0a2300f40639f6f8c1ecbc662141e4e3e1ee07"},
{file = "psycopg2_binary-2.9.9-cp312-cp312-musllinux_1_1_ppc64le.whl", hash = "sha256:911dda9c487075abd54e644ccdf5e5c16773470a6a5d3826fda76699410066fb"},
{file = "psycopg2_binary-2.9.9-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:57fede879f08d23c85140a360c6a77709113efd1c993923c59fde17aa27599fe"},
+ {file = "psycopg2_binary-2.9.9-cp312-cp312-win32.whl", hash = "sha256:64cf30263844fa208851ebb13b0732ce674d8ec6a0c86a4e160495d299ba3c93"},
+ {file = "psycopg2_binary-2.9.9-cp312-cp312-win_amd64.whl", hash = "sha256:81ff62668af011f9a48787564ab7eded4e9fb17a4a6a74af5ffa6a457400d2ab"},
{file = "psycopg2_binary-2.9.9-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:2293b001e319ab0d869d660a704942c9e2cce19745262a8aba2115ef41a0a42a"},
{file = "psycopg2_binary-2.9.9-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:03ef7df18daf2c4c07e2695e8cfd5ee7f748a1d54d802330985a78d2a5a6dca9"},
{file = "psycopg2_binary-2.9.9-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0a602ea5aff39bb9fac6308e9c9d82b9a35c2bf288e184a816002c9fae930b77"},
@@ -5267,6 +5269,7 @@ files = [
{file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
{file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
{file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
+ {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"},
{file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
{file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
{file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
@@ -7144,13 +7147,13 @@ files = [
[[package]]
name = "tqdm"
-version = "4.66.2"
+version = "4.66.3"
description = "Fast, Extensible Progress Meter"
optional = false
python-versions = ">=3.7"
files = [
- {file = "tqdm-4.66.2-py3-none-any.whl", hash = "sha256:1ee4f8a893eb9bef51c6e35730cebf234d5d0b6bd112b0271e10ed7c24a02bd9"},
- {file = "tqdm-4.66.2.tar.gz", hash = "sha256:6cd52cdf0fef0e0f543299cfc96fec90d7b8a7e88745f411ec33eb44d5ed3531"},
+ {file = "tqdm-4.66.3-py3-none-any.whl", hash = "sha256:4f41d54107ff9a223dca80b53efe4fb654c67efaba7f47bada3ee9d50e05bd53"},
+ {file = "tqdm-4.66.3.tar.gz", hash = "sha256:23097a41eba115ba99ecae40d06444c15d1c0c698d527a01c6c8bd1c5d0647e5"},
]
[package.dependencies]
diff --git a/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py b/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py
index 54dee30a68ca1..5461847597be8 100644
--- a/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py
+++ b/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/llama_index/embeddings/azure_openai/base.py
@@ -46,7 +46,7 @@ def __init__(
# azure specific
azure_endpoint: Optional[str] = None,
azure_deployment: Optional[str] = None,
- azure_ad_token_provider: AzureADTokenProvider = None,
+ azure_ad_token_provider: Optional[AzureADTokenProvider] = None,
deployment_name: Optional[str] = None,
max_retries: int = 10,
reuse_client: bool = True,
@@ -60,6 +60,8 @@ def __init__(
"azure_endpoint", azure_endpoint, "AZURE_OPENAI_ENDPOINT", ""
)
+ api_key = get_from_param_or_env("api_key", api_key, "AZURE_OPENAI_API_KEY")
+
azure_deployment = resolve_from_aliases(
azure_deployment,
deployment_name,
diff --git a/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/pyproject.toml b/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/pyproject.toml
index f734e8983ab7d..4b631b838d811 100644
--- a/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/pyproject.toml
+++ b/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/pyproject.toml
@@ -27,7 +27,7 @@ exclude = ["**/BUILD"]
license = "MIT"
name = "llama-index-embeddings-azure-openai"
readme = "README.md"
-version = "0.1.8"
+version = "0.1.9"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
diff --git a/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/tests/test_azure_openai.py b/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/tests/test_azure_openai.py
index a4ffcbf450ae4..38fc4bf79ea63 100644
--- a/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/tests/test_azure_openai.py
+++ b/llama-index-integrations/embeddings/llama-index-embeddings-azure-openai/tests/test_azure_openai.py
@@ -11,7 +11,7 @@ def test_custom_http_client(azure_openai_mock: MagicMock) -> None:
Should get passed on to the implementation from OpenAI.
"""
custom_http_client = httpx.Client()
- embedding = AzureOpenAIEmbedding(http_client=custom_http_client)
+ embedding = AzureOpenAIEmbedding(http_client=custom_http_client, api_key="mock")
embedding._get_client()
azure_openai_mock.assert_called()
kwargs = azure_openai_mock.call_args.kwargs
diff --git a/llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py b/llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py
index 35ef8e9a63d34..481e80a348b43 100644
--- a/llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py
+++ b/llama-index-integrations/llms/llama-index-llms-azure-openai/llama_index/llms/azure_openai/base.py
@@ -93,7 +93,7 @@ def __init__(
# azure specific
azure_endpoint: Optional[str] = None,
azure_deployment: Optional[str] = None,
- azure_ad_token_provider: AzureADTokenProvider = None,
+ azure_ad_token_provider: Optional[AzureADTokenProvider] = None,
use_azure_ad: bool = False,
callback_manager: Optional[CallbackManager] = None,
# aliases for engine
@@ -186,6 +186,16 @@ def _get_credential_kwargs(self, **kwargs: Any) -> Dict[str, Any]:
if self.use_azure_ad:
self._azure_ad_token = refresh_openai_azuread_token(self._azure_ad_token)
self.api_key = self._azure_ad_token.token
+ else:
+ import os
+
+ self.api_key = self.api_key or os.getenv("AZURE_OPENAI_API_KEY")
+
+ if self.api_key is None:
+ raise ValueError(
+ "You must set an `api_key` parameter. "
+ "Alternatively, you can set the AZURE_OPENAI_API_KEY env var OR set `use_azure_ad=True`."
+ )
return {
"api_key": self.api_key,
diff --git a/llama-index-integrations/llms/llama-index-llms-azure-openai/pyproject.toml b/llama-index-integrations/llms/llama-index-llms-azure-openai/pyproject.toml
index 3e2242f309c9e..0924c6a43c3fc 100644
--- a/llama-index-integrations/llms/llama-index-llms-azure-openai/pyproject.toml
+++ b/llama-index-integrations/llms/llama-index-llms-azure-openai/pyproject.toml
@@ -29,7 +29,7 @@ exclude = ["**/BUILD"]
license = "MIT"
name = "llama-index-llms-azure-openai"
readme = "README.md"
-version = "0.1.6"
+version = "0.1.7"
[tool.poetry.dependencies]
python = ">=3.8.1,<4.0"
diff --git a/llama-index-integrations/llms/llama-index-llms-azure-openai/tests/test_azure_openai.py b/llama-index-integrations/llms/llama-index-llms-azure-openai/tests/test_azure_openai.py
index 7529bad7ebec8..9e66ee8f4236b 100644
--- a/llama-index-integrations/llms/llama-index-llms-azure-openai/tests/test_azure_openai.py
+++ b/llama-index-integrations/llms/llama-index-llms-azure-openai/tests/test_azure_openai.py
@@ -40,7 +40,9 @@ def test_custom_http_client(sync_azure_openai_mock: MagicMock) -> None:
mock_instance = sync_azure_openai_mock.return_value
# Valid mocked result required to not run into another error
mock_instance.chat.completions.create.return_value = mock_chat_completion_v1()
- azure_openai = AzureOpenAI(engine="foo bar", http_client=custom_http_client)
+ azure_openai = AzureOpenAI(
+ engine="foo bar", http_client=custom_http_client, api_key="mock"
+ )
azure_openai.complete("test prompt")
sync_azure_openai_mock.assert_called()
kwargs = sync_azure_openai_mock.call_args.kwargs
diff --git a/llama-index-integrations/llms/llama-index-llms-friendli/poetry.lock b/llama-index-integrations/llms/llama-index-llms-friendli/poetry.lock
index 06d5aef2aaf52..f22240540dad3 100644
--- a/llama-index-integrations/llms/llama-index-llms-friendli/poetry.lock
+++ b/llama-index-integrations/llms/llama-index-llms-friendli/poetry.lock
@@ -4762,13 +4762,13 @@ files = [
[[package]]
name = "tqdm"
-version = "4.66.2"
+version = "4.66.3"
description = "Fast, Extensible Progress Meter"
optional = false
python-versions = ">=3.7"
files = [
- {file = "tqdm-4.66.2-py3-none-any.whl", hash = "sha256:1ee4f8a893eb9bef51c6e35730cebf234d5d0b6bd112b0271e10ed7c24a02bd9"},
- {file = "tqdm-4.66.2.tar.gz", hash = "sha256:6cd52cdf0fef0e0f543299cfc96fec90d7b8a7e88745f411ec33eb44d5ed3531"},
+ {file = "tqdm-4.66.3-py3-none-any.whl", hash = "sha256:4f41d54107ff9a223dca80b53efe4fb654c67efaba7f47bada3ee9d50e05bd53"},
+ {file = "tqdm-4.66.3.tar.gz", hash = "sha256:23097a41eba115ba99ecae40d06444c15d1c0c698d527a01c6c8bd1c5d0647e5"},
]
[package.dependencies]
diff --git a/llama-index-integrations/llms/llama-index-llms-mistralai/llama_index/llms/mistralai/base.py b/llama-index-integrations/llms/llama-index-llms-mistralai/llama_index/llms/mistralai/base.py
index 536363ff4caca..1943d1a3f3720 100644
--- a/llama-index-integrations/llms/llama-index-llms-mistralai/llama_index/llms/mistralai/base.py
+++ b/llama-index-integrations/llms/llama-index-llms-mistralai/llama_index/llms/mistralai/base.py
@@ -324,7 +324,7 @@ async def astream_chat(
messages = to_mistral_chatmessage(messages)
all_kwargs = self._get_all_kwargs(**kwargs)
- response = await self._aclient.chat_stream(messages=messages, **all_kwargs)
+ response = self._aclient.chat_stream(messages=messages, **all_kwargs)
async def gen() -> ChatResponseAsyncGen:
content = ""
diff --git a/llama-index-integrations/llms/llama-index-llms-mistralai/pyproject.toml b/llama-index-integrations/llms/llama-index-llms-mistralai/pyproject.toml
index 70acb9160a3bc..517d836ceadbe 100644
--- a/llama-index-integrations/llms/llama-index-llms-mistralai/pyproject.toml
+++ b/llama-index-integrations/llms/llama-index-llms-mistralai/pyproject.toml
@@ -27,7 +27,7 @@ exclude = ["**/BUILD"]
license = "MIT"
name = "llama-index-llms-mistralai"
readme = "README.md"
-version = "0.1.11"
+version = "0.1.12"
[tool.poetry.dependencies]
python = ">=3.9,<4.0"
diff --git a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py
index de09db051f650..fa822c8f61db0 100644
--- a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py
+++ b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py
@@ -29,6 +29,9 @@
from llama_index.readers.web.sitemap.base import (
SitemapReader,
)
+from llama_index.readers.web.spider_web.base import (
+ SpiderWebReader,
+)
from llama_index.readers.web.trafilatura_web.base import (
TrafilaturaWebReader,
)
@@ -53,6 +56,7 @@
"RssNewsReader",
"SimpleWebPageReader",
"SitemapReader",
+ "SpiderWebReader",
"TrafilaturaWebReader",
"UnstructuredURLLoader",
"WholeSiteReader",
diff --git a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/BUILD b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/BUILD
new file mode 100644
index 0000000000000..0a271ceaa061d
--- /dev/null
+++ b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/BUILD
@@ -0,0 +1,5 @@
+python_requirements(
+ name="reqs",
+)
+
+python_sources()
diff --git a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/README.md b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/README.md
new file mode 100644
index 0000000000000..c054c367addfa
--- /dev/null
+++ b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/README.md
@@ -0,0 +1,43 @@
+# Spider Web Reader
+
+[Spider](https://spider.cloud/?ref=langchain) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results) crawler. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
+
+Spider allows you to use high performance proxies to prevent detection, caches AI actions, webhooks for crawling status, scheduled crawls etc...
+
+## Prerequisites
+
+You need to have a Spider api key to use this loader. You can get one on [spider.cloud](https://spider.cloud).
+
+```pip
+pip install llama-index
+```
+
+```python
+# Scrape single URL
+from llama_index.readers.web import SpiderWebReader
+
+spider_reader = SpiderWebReader(
+ api_key="YOUR_API_KEY", # Get one at https://spider.cloud
+ mode="scrape",
+ # params={} # Optional parameters see more on https://spider.cloud/docs/api
+)
+
+documents = spider_reader.load_data(url="https://spider.cloud")
+print(documents)
+```
+
+```python
+# Crawl domain with deeper crawling following subpages
+from llama_index.readers.web import SpiderWebReader
+
+spider_reader = SpiderWebReader(
+ api_key="YOUR_API_KEY",
+ mode="crawl",
+ # params={} # Optional parameters see more on https://spider.cloud/docs/api
+)
+
+documents = spider_reader.load_data(url="https://spider.cloud")
+print(documents)
+```
+
+For guides and documentation, visit [Spider](https://spider.cloud/docs/api)
diff --git a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/__init__.py b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/__init__.py
new file mode 100644
index 0000000000000..e69de29bb2d1d
diff --git a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/base.py b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/base.py
new file mode 100644
index 0000000000000..a1081ebf704bf
--- /dev/null
+++ b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/base.py
@@ -0,0 +1,70 @@
+from typing import List, Optional, Literal
+
+from llama_index.core.readers.base import BasePydanticReader
+from llama_index.core.schema import Document
+
+
+class SpiderWebReader(BasePydanticReader):
+ """
+ Scrapes a URL for data and returns llm-ready data with `Spider.cloud`.
+
+ Must have the Python package `spider-client` installed and a Spider API key.
+ See https://spider.cloud for more.
+
+ Args:
+ api_key (str): The Spider API key, get one at https://spider.cloud
+ mode (Mode): "Scrape" the url (default) or "crawl" the url following all subpages.
+ params (dict): Additional parameters to pass to the Spider API.
+ """
+
+ class Config:
+ use_enum_values = True
+ extra = "allow"
+
+ def __init__(
+ self,
+ *,
+ api_key: Optional[str] = None,
+ mode: Literal["scrape", "crawl"] = "scrape",
+ params: Optional[dict] = None,
+ ) -> None:
+ super().__init__(api_key=api_key, mode=mode, params=params)
+
+ if params is None:
+ params = {"return_format": "markdown", "metadata": True}
+ try:
+ from spider import Spider
+ except ImportError:
+ raise ImportError(
+ "`spider-client` package not found, please run `pip install spider-client`"
+ )
+ self.spider = Spider(api_key=api_key)
+ self.mode = mode
+ self.params = params
+
+ def load_data(self, url: str) -> List[Document]:
+ if self.mode != "scrape" and self.mode != "crawl":
+ raise ValueError(
+ "Unknown mode in `mode` parameter, `scrape` or `crawl` is the allowed modes"
+ )
+ action = (
+ self.spider.scrape_url if self.mode == "scrape" else self.spider.crawl_url
+ )
+ spider_docs = action(url=url, params=self.params)
+
+ if not spider_docs:
+ return [Document(page_content="", metadata={})]
+
+ documents = []
+ if isinstance(spider_docs, list):
+ for doc in spider_docs:
+ text = doc.get("content", "")
+ if text is not None:
+ documents.append(
+ Document(
+ text=text,
+ metadata=doc.get("metadata", {}),
+ )
+ )
+
+ return documents
diff --git a/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/requirements.txt b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/requirements.txt
new file mode 100644
index 0000000000000..c7a53cfa5d5c9
--- /dev/null
+++ b/llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/spider_web/requirements.txt
@@ -0,0 +1 @@
+spider-client
diff --git a/llama-index-integrations/readers/llama-index-readers-web/pyproject.toml b/llama-index-integrations/readers/llama-index-readers-web/pyproject.toml
index f0a62accb6128..fa729f6dbb994 100644
--- a/llama-index-integrations/readers/llama-index-readers-web/pyproject.toml
+++ b/llama-index-integrations/readers/llama-index-readers-web/pyproject.toml
@@ -22,6 +22,7 @@ RssNewsReader = "ruze00"
RssReader = "bborn"
SimpleWebPageReader = "thejessezhang"
SitemapReader = "selamanse"
+SpiderReader = "WilliamEspegren"
TrafilaturaWebReader = "NA"
UnstructuredURLLoader = "kravetsmic"
WholeSiteReader = "an-bluecat"
@@ -55,6 +56,7 @@ requests = "^2.31.0"
urllib3 = ">=1.1.0"
playwright = ">=1.30,<2.0"
newspaper3k = "^0.2.8"
+spider-client = "^0.0.11"
[tool.poetry.group.dev.dependencies]
ipython = "8.10.0"
diff --git a/llama-index-integrations/readers/llama-index-readers-web/requirements.txt b/llama-index-integrations/readers/llama-index-readers-web/requirements.txt
index 36508c193c590..35c83209ef7e0 100644
--- a/llama-index-integrations/readers/llama-index-readers-web/requirements.txt
+++ b/llama-index-integrations/readers/llama-index-readers-web/requirements.txt
@@ -7,3 +7,4 @@ playwright~=1.42.0
newspaper3k
selenium
chromedriver-autoinstaller
+spider-client
diff --git a/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/BUILD b/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/BUILD
new file mode 100644
index 0000000000000..dabf212d7e716
--- /dev/null
+++ b/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/BUILD
@@ -0,0 +1 @@
+python_tests()
diff --git a/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/__init__.py b/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/__init__.py
new file mode 100644
index 0000000000000..e69de29bb2d1d
diff --git a/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/test_base.py b/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/test_base.py
new file mode 100644
index 0000000000000..4f5d40300e092
--- /dev/null
+++ b/llama-index-integrations/readers/llama-index-readers-web/tests/readers/web/spider_web/test_base.py
@@ -0,0 +1,7 @@
+from llama_index.core.readers.base import BaseReader
+from llama_index.readers.web.spider_web.base import SpiderWebReader
+
+
+def test_class():
+ names_of_base_classes = [b.__name__ for b in SpiderWebReader.__mro__]
+ assert BaseReader.__name__ in names_of_base_classes
diff --git a/llama-index-integrations/retrievers/llama-index-retrievers-bedrock/README.md b/llama-index-integrations/retrievers/llama-index-retrievers-bedrock/README.md
index 3c85b689acd32..a024284bce2c5 100644
--- a/llama-index-integrations/retrievers/llama-index-retrievers-bedrock/README.md
+++ b/llama-index-integrations/retrievers/llama-index-retrievers-bedrock/README.md
@@ -10,10 +10,36 @@
> Knowledge base can be configured through [AWS Console](https://aws.amazon.com/console/) or by using [AWS SDKs](https://aws.amazon.com/developer/tools/).
-### Notebook
+## Installation
-Explore the retriever using Notebook present at:
+```
+pip install llama-index-retrievers-bedrock
+```
+
+## Usage
```
-docs/docs/examples/retrievers/bedrock_retriever.ipynb
+from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever
+
+retriever = AmazonKnowledgeBasesRetriever(
+ knowledge_base_id="",
+ retrieval_config={
+ "vectorSearchConfiguration": {
+ "numberOfResults": 4,
+ "overrideSearchType": "HYBRID",
+ "filter": {"equals": {"key": "tag", "value": "space"}},
+ }
+ },
+)
+
+query = "How big is Milky Way as compared to the entire universe?"
+retrieved_results = retriever.retrieve(query)
+
+# Prints the first retrieved result
+print(retrieved_results[0].get_content())
```
+
+## Notebook
+
+Explore the retriever using Notebook present at:
+https://docs.llamaindex.ai/en/latest/examples/retrievers/bedrock_retriever/
diff --git a/llama-index-integrations/tools/llama-index-tools-finance/poetry.lock b/llama-index-integrations/tools/llama-index-tools-finance/poetry.lock
index 2d9f32483fec4..9257483933761 100644
--- a/llama-index-integrations/tools/llama-index-tools-finance/poetry.lock
+++ b/llama-index-integrations/tools/llama-index-tools-finance/poetry.lock
@@ -4105,13 +4105,13 @@ files = [
[[package]]
name = "tqdm"
-version = "4.66.2"
+version = "4.66.3"
description = "Fast, Extensible Progress Meter"
optional = false
python-versions = ">=3.7"
files = [
- {file = "tqdm-4.66.2-py3-none-any.whl", hash = "sha256:1ee4f8a893eb9bef51c6e35730cebf234d5d0b6bd112b0271e10ed7c24a02bd9"},
- {file = "tqdm-4.66.2.tar.gz", hash = "sha256:6cd52cdf0fef0e0f543299cfc96fec90d7b8a7e88745f411ec33eb44d5ed3531"},
+ {file = "tqdm-4.66.3-py3-none-any.whl", hash = "sha256:4f41d54107ff9a223dca80b53efe4fb654c67efaba7f47bada3ee9d50e05bd53"},
+ {file = "tqdm-4.66.3.tar.gz", hash = "sha256:23097a41eba115ba99ecae40d06444c15d1c0c698d527a01c6c8bd1c5d0647e5"},
]
[package.dependencies]
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/.gitignore b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/.gitignore
new file mode 100644
index 0000000000000..990c18de22908
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/.gitignore
@@ -0,0 +1,153 @@
+llama_index/_static
+.DS_Store
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+bin/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+etc/
+include/
+lib/
+lib64/
+parts/
+sdist/
+share/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+# Usually these files are written by a python script from a template
+# before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+.ruff_cache
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+notebooks/
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+# However, in case of collaboration, if having platform-specific dependencies or dependencies
+# having no cross-platform support, pipenv may install dependencies that don't work, or not
+# install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+pyvenv.cfg
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# Jetbrains
+.idea
+modules/
+*.swp
+
+# VsCode
+.vscode
+
+# pipenv
+Pipfile
+Pipfile.lock
+
+# pyright
+pyrightconfig.json
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/BUILD b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/BUILD
new file mode 100644
index 0000000000000..72b97f383ad9b
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/BUILD
@@ -0,0 +1,4 @@
+poetry_requirements(
+ name="poetry",
+ module_mapping={"pyvespa": ["vespa"]},
+)
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/CHANGELOG.md b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/CHANGELOG.md
new file mode 100644
index 0000000000000..633b40f215c09
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/CHANGELOG.md
@@ -0,0 +1,9 @@
+# CHANGELOG — llama-index-vector-stores-opensearch
+
+## [0.1.2]
+
+- Adds OpensearchVectorClient as top-level import
+
+## [0.1.1]
+
+- Fixes strict equality in dependency of llama-index-core
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/Makefile b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/Makefile
new file mode 100644
index 0000000000000..b9eab05aa3706
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/Makefile
@@ -0,0 +1,17 @@
+GIT_ROOT ?= $(shell git rev-parse --show-toplevel)
+
+help: ## Show all Makefile targets.
+ @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'
+
+format: ## Run code autoformatters (black).
+ pre-commit install
+ git ls-files | xargs pre-commit run black --files
+
+lint: ## Run linters: pre-commit (black, ruff, codespell) and mypy
+ pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
+
+test: ## Run tests via pytest.
+ pytest tests
+
+watch-docs: ## Build and watch documentation.
+ sphinx-autobuild docs/ docs/_build/html --open-browser --watch $(GIT_ROOT)/llama_index/
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/README.md b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/README.md
new file mode 100644
index 0000000000000..0956439e74517
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/README.md
@@ -0,0 +1,143 @@
+
+
+# LlamaIndex Vector_Stores Integration: Vespa
+
+[Vespa.ai](https://vespa.ai/) is an open-source big data serving engine. It is designed for low-latency and high-throughput serving of data and models. Vespa.ai is used by many companies to serve search results, recommendations, and rankings for billions of documents and users, expecting response times in the milliseconds.
+
+This integration allows you to use Vespa.ai as a vector store for LlamaIndex. Vespa has integrated support for [embedding inference](https://docs.vespa.ai/en/embedding.html), so you don't need to run a separate service for these tasks.
+
+Huggingface 🤗 embedders are supported, as well as SPLADE and ColBERT.
+
+## Abstraction level of this integration
+
+To make it really simple to get started, we provide a template Vespa application that will be deployed upon initializing the vector store. This removes some of the complexity of setting up Vespa for the first time, but for serious use cases, we strongly recommend that you read the [Vespa documentation](docs.vespa.ai) and tailor the application to your needs.
+
+## The template
+
+The provided template Vespa application can be seen below:
+
+```python
+from vespa.package import (
+ ApplicationPackage,
+ Field,
+ Schema,
+ Document,
+ HNSW,
+ RankProfile,
+ Component,
+ Parameter,
+ FieldSet,
+ GlobalPhaseRanking,
+ Function,
+)
+
+hybrid_template = ApplicationPackage(
+ name="hybridsearch",
+ schema=[
+ Schema(
+ name="doc",
+ document=Document(
+ fields=[
+ Field(name="id", type="string", indexing=["summary"]),
+ Field(
+ name="metadata", type="string", indexing=["summary"]
+ ),
+ Field(
+ name="text",
+ type="string",
+ indexing=["index", "summary"],
+ index="enable-bm25",
+ bolding=True,
+ ),
+ Field(
+ name="embedding",
+ type="tensor(x[384])",
+ indexing=[
+ "input text",
+ "embed",
+ "index",
+ "attribute",
+ ],
+ ann=HNSW(distance_metric="angular"),
+ is_document_field=False,
+ ),
+ ]
+ ),
+ fieldsets=[FieldSet(name="default", fields=["text", "metadata"])],
+ rank_profiles=[
+ RankProfile(
+ name="bm25",
+ inputs=[("query(q)", "tensor(x[384])")],
+ functions=[
+ Function(name="bm25sum", expression="bm25(text)")
+ ],
+ first_phase="bm25sum",
+ ),
+ RankProfile(
+ name="semantic",
+ inputs=[("query(q)", "tensor(x[384])")],
+ first_phase="closeness(field, embedding)",
+ ),
+ RankProfile(
+ name="fusion",
+ inherits="bm25",
+ inputs=[("query(q)", "tensor(x[384])")],
+ first_phase="closeness(field, embedding)",
+ global_phase=GlobalPhaseRanking(
+ expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))",
+ rerank_count=1000,
+ ),
+ ),
+ ],
+ )
+ ],
+ components=[
+ Component(
+ id="e5",
+ type="hugging-face-embedder",
+ parameters=[
+ Parameter(
+ "transformer-model",
+ {
+ "url": "https://github.com/vespa-engine/sample-apps/raw/master/simple-semantic-search/model/e5-small-v2-int8.onnx"
+ },
+ ),
+ Parameter(
+ "tokenizer-model",
+ {
+ "url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/simple-semantic-search/model/tokenizer.json"
+ },
+ ),
+ ],
+ )
+ ],
+)
+```
+
+Note that the fields `id`, `metadata`, `text`, and `embedding` are required for the integration to work.
+The schema name must also be `doc`, and the rank profiles must be named `bm25`, `semantic`, and `fusion`.
+
+Other than that you are free to modify as you see fit by switching out embedding models, adding more fields, or changing the ranking expressions.
+
+For more details, check out this Pyvespa example notebook on [hybrid search](https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html).
+
+## Going to production
+
+If you are ready to graduate to a production setup, we highly recommend to check out the [Vespa Cloud](https://cloud.vespa.ai/) service, where we manage all infrastructure and operations for you. Free trials are available.
+
+## Next steps
+
+There are many awesome features in Vespa, that are not exposed directly in this integration, check out [Pyvespa examples](https://pyvespa.readthedocs.io/en/latest/examples/pyvespa-examples.html) for some inspiration on what you can do with Vespa.
+
+Teasers:
+
+- Binary + Matryoshka embeddings.
+- ColBERT.
+- ONNX models.
+- XGBoost and lightGBM models for ranking.
+- Multivector indexing.
+- and much more.
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/BUILD b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/BUILD
new file mode 100644
index 0000000000000..db46e8d6c978c
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/BUILD
@@ -0,0 +1 @@
+python_sources()
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/__init__.py b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/__init__.py
new file mode 100644
index 0000000000000..7472eed62961b
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/__init__.py
@@ -0,0 +1,6 @@
+from llama_index.vector_stores.vespa.base import (
+ VespaVectorStore,
+)
+from llama_index.vector_stores.vespa.templates import hybrid_template
+
+__all__ = ["VespaVectorStore", "hybrid_template"]
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/base.py b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/base.py
new file mode 100644
index 0000000000000..74ec8d657dc06
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/base.py
@@ -0,0 +1,505 @@
+"""Vespa vector store."""
+
+from typing import Any, List, Optional, Callable
+
+
+from llama_index.core.schema import BaseNode, MetadataMode
+from llama_index.core.vector_stores.types import (
+ VectorStore,
+ VectorStoreQuery,
+ VectorStoreQueryMode,
+ VectorStoreQueryResult,
+)
+from llama_index.core.vector_stores.utils import (
+ node_to_metadata_dict,
+ metadata_dict_to_node,
+)
+
+from llama_index.vector_stores.vespa.templates import hybrid_template
+
+import asyncio
+import logging
+import json
+import sys
+
+try:
+ from vespa.application import Vespa
+ from vespa.package import ApplicationPackage
+ from vespa.io import VespaResponse
+ from vespa.deployment import VespaCloud, VespaDocker
+except ImportError:
+ raise ModuleNotFoundError(
+ "pyvespa not installed. Please install it via `pip install pyvespa`"
+ )
+
+logger = logging.getLogger(__name__)
+handler = logging.StreamHandler(stream=sys.stdout)
+logger.addHandler(handler)
+logger.setLevel(logging.INFO)
+
+
+def callback(response: VespaResponse, id: str):
+ if not response.is_successful():
+ logger.debug(
+ f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
+ )
+
+
+class VespaVectorStore(VectorStore):
+ """
+ Vespa vector store.
+
+ Can be initialized in several ways:
+ 1. (Default) Initialize Vespa vector store with default hybrid template and local (docker) deployment.
+ 2. Initialize by providing an application package created in pyvespa (can be deployed locally or to Vespa cloud).
+ 3. Initialize from previously deployed Vespa application by providing URL. (Local or cloud deployment).
+
+ The application must be set up with the following fields:
+ - id: Document id
+ - text: Text field
+ - embedding: Field to store embedding vectors.
+ - metadata: Metadata field (all metadata will be stored here)
+
+ The application must be set up with the following rank profiles:
+ - bm25: For text search
+ - semantic: For semantic search
+ - fusion: For semantic hybrid search
+
+ When creating a VectorStoreIndex from VespaVectorStore, the index will add documents to the Vespa application.
+ Be ware that the Vespa container will be reused if not deleted between deployments, to avoid data duplication.
+ During query time, the index queries the Vespa application to get the top k most relevant hits.
+
+ Args:
+ application_package (ApplicationPackage): Application package
+ deployment_target (str): Deployment target, either `local` or `cloud`
+ port (int): Port that Vespa application will run on. Only applicable if deployment_target is `local`
+ default_schema_name (str): Schema name in Vespa application
+ namespace (str): Namespace in Vespa application. See https://docs.vespa.ai/en/documents.html#namespace. Defaults to `default`.
+ embeddings_outside_vespa (bool): Whether embeddings are created outside Vespa, or not.
+ url (Optional[str]): URL of deployed Vespa application.
+ groupname (Optional[str]): Group name in Vespa application, only applicable in `streaming` mode, see https://pyvespa.readthedocs.io/en/latest/examples/scaling-personal-ai-assistants-with-streaming-mode-cloud.html#A-summary-of-Vespa-streaming-mode
+ tenant (Optional[str]): Tenant for Vespa application. Applicable only if deployment_target is `cloud`
+ key_location (Optional[str]): Location of the control plane key used for signing HTTP requests to the Vespa Cloud.
+ key_content (Optional[str]): Content of the control plane key used for signing HTTP requests to the Vespa Cloud. Use only when key file is not available.
+ auth_client_token_id (Optional[str]): Use token based data plane authentication. This is the token name configured in the Vespa Cloud Console. This is used to configure Vespa services.xml. The token is given read and write permissions.
+ kwargs (Any): Additional kwargs for Vespa application
+
+ Examples:
+ `pip install llama-index-vector-stores-vespa`
+
+ ```python
+ from llama_index.core import VectorStoreIndex
+ from llama_index.vector_stores.vespa import VespaVectorStore
+
+ vector_store = VespaVectorStore()
+ storage_context = StorageContext.from_defaults(vector_store=vector_store)
+ index = VectorStoreIndex(nodes, storage_context=storage_context)
+ retriever = index.as_retriever()
+ retriever.retrieve("Who directed inception?")
+
+ ```
+ """
+
+ stores_text: bool = True
+ is_embedding_query: bool = False
+ flat_metadata: bool = True
+
+ def __init__(
+ self,
+ application_package: ApplicationPackage = hybrid_template,
+ namespace: str = "default",
+ default_schema_name: str = "doc",
+ deployment_target: str = "local", # "local" or "cloud"
+ port: int = 8080,
+ embeddings_outside_vespa: bool = False,
+ url: Optional[str] = None,
+ groupname: Optional[str] = None,
+ tenant: Optional[str] = None,
+ application: Optional[str] = "hybridsearch",
+ key_location: Optional[str] = None,
+ key_content: Optional[str] = None,
+ auth_client_token_id: Optional[str] = None,
+ **kwargs: Any,
+ ) -> None:
+ # Verify that application_package is an instance of ApplicationPackage
+ if not isinstance(application_package, ApplicationPackage):
+ raise ValueError(
+ "application_package must be an instance of vespa.package.ApplicationPackage"
+ )
+ if application_package == hybrid_template:
+ logger.info(
+ "Using default hybrid template. Please make sure that the Vespa application is set up with the correct schema and rank profile."
+ )
+ # Initialize all parameters
+ self.application_package = application_package
+ self.deployment_target = deployment_target
+ self.default_schema_name = default_schema_name
+ self.namespace = namespace
+ self.embeddings_outside_vespa = embeddings_outside_vespa
+ self.port = port
+ self.url = url
+ self.groupname = groupname
+ self.tenant = tenant
+ self.application = application
+ self.key_location = key_location
+ self.key_content = key_content
+ self.auth_client_token_id = auth_client_token_id
+ self.kwargs = kwargs
+ if self.url is None:
+ self.app = self._deploy()
+ else:
+ self.app = self._try_get_running_app()
+
+ @property
+ def client(self) -> Vespa:
+ """Get client."""
+ return self.app
+
+ def _try_get_running_app(self) -> Vespa:
+ app = Vespa(url=f"{self.url}:{self.port}")
+ status = app.get_application_status()
+ if status.status_code == 200:
+ return app
+ else:
+ raise ConnectionError(
+ f"Vespa application not running on url {self.url} and port {self.port}. Please start Vespa application first."
+ )
+
+ def _deploy(self) -> Vespa:
+ if self.deployment_target == "cloud":
+ app = self._deploy_app_cloud()
+ elif self.deployment_target == "local":
+ app = self._deploy_app_local()
+ else:
+ raise ValueError(
+ f"Deployment target {self.deployment_target} not supported. Please choose either `local` or `cloud`."
+ )
+ return app
+
+ def _deploy_app_local(self) -> Vespa:
+ logger.info(f"Deploying Vespa application {self.application} to Vespa Docker.")
+ return VespaDocker(port=8080).deploy(self.application_package)
+
+ def _deploy_app_cloud(self) -> Vespa:
+ logger.info(f"Deploying Vespa application {self.application} to Vespa Cloud.")
+ return VespaCloud(
+ tenant=self.tenant,
+ application=self.application,
+ application_package=self.application_package,
+ key_location=self.key_location,
+ key_content=self.key_content,
+ auth_client_token_id=self.auth_client_token_id,
+ **self.kwargs,
+ ).deploy()
+
+ def add(
+ self,
+ nodes: List[BaseNode],
+ schema: Optional[str] = None,
+ callback: Optional[Callable[[VespaResponse, str], None]] = callback,
+ ) -> List[str]:
+ """
+ Add nodes to vector store.
+
+ Args:
+ nodes (List[BaseNode]): List of nodes to add
+ schema (Optional[str]): Schema name in Vespa application to add nodes to. Defaults to `default_schema_name`.
+ """
+ # Create vespa iterable from nodes
+ ids = []
+ data_to_insert = []
+ for node in nodes:
+ metadata = node_to_metadata_dict(
+ node, remove_text=False, flat_metadata=self.flat_metadata
+ )
+ logger.debug(f"Metadata: {metadata}")
+ entry = {
+ "id": node.node_id,
+ "fields": {
+ "id": node.node_id,
+ "text": node.get_content(metadata_mode=MetadataMode.NONE) or "",
+ "metadata": json.dumps(metadata),
+ },
+ }
+ if self.embeddings_outside_vespa:
+ entry["fields"]["embedding"] = node.get_embedding()
+ data_to_insert.append(entry)
+ ids.append(node.node_id)
+
+ self.app.feed_iterable(
+ data_to_insert,
+ schema=schema or self.default_schema_name,
+ namespace=self.namespace,
+ operation_type="feed",
+ callback=callback,
+ )
+ return ids
+
+ async def async_add(
+ self,
+ nodes: List[BaseNode],
+ schema: Optional[str] = None,
+ callback: Optional[Callable[[VespaResponse, str], None]] = callback,
+ max_connections: int = 10,
+ num_concurrent_requests: int = 1000,
+ total_timeout: int = 60,
+ **kwargs: Any,
+ ) -> List[str]:
+ """
+ Add nodes to vector store asynchronously.
+
+ Args:
+ nodes (List[BaseNode]): List of nodes to add
+ schema (Optional[str]): Schema name in Vespa application to add nodes to. Defaults to `default_schema_name`.
+ max_connections (int): Maximum number of connections to Vespa application
+ num_concurrent_requests (int): Maximum number of concurrent requests
+ total_timeout (int): Total timeout for all requests
+ kwargs (Any): Additional kwargs for Vespa application
+ """
+ semaphore = asyncio.Semaphore(max_concurrent_requests)
+ ids = []
+ data_to_insert = []
+ for node in nodes:
+ metadata = node_to_metadata_dict(
+ node, remove_text=False, flat_metadata=self.flat_metadata
+ )
+ logger.debug(f"Metadata: {metadata}")
+ entry = {
+ "id": node.node_id,
+ "fields": {
+ "id": node.node_id,
+ "text": node.get_content(metadata_mode=MetadataMode.NONE) or "",
+ "metadata": json.dumps(metadata),
+ },
+ }
+ if self.embeddings_outside_vespa:
+ entry["fields"]["embedding"] = node.get_embedding()
+ data_to_insert.append(entry)
+ ids.append(node.node_id)
+
+ async with self.app.asyncio(
+ connections=max_connections, total_timeout=total_timeout
+ ) as async_app:
+ for doc in data_to_insert:
+ async with semaphore:
+ task = asyncio.create_task(
+ async_app.feed_data_point(
+ data_id=doc["id"],
+ fields=doc["fields"],
+ schema=schema or self.default_schema_name,
+ namespace=self.namespace,
+ timeout=10,
+ )
+ )
+ tasks.append(task)
+
+ results = await asyncio.wait(tasks, return_when=asyncio.ALL_COMPLETED)
+ for result in results:
+ if result.exception():
+ raise result.exception
+ return ids
+
+ def delete(
+ self,
+ ref_doc_id: str,
+ namespace: Optional[str] = None,
+ **delete_kwargs: Any,
+ ) -> None:
+ """
+ Delete nodes using with ref_doc_id.
+ """
+ response: VespaResponse = self.app.delete_data(
+ schema=self.default_schema_name,
+ namespace=namespace or self.namespace,
+ data_id=ref_doc_id,
+ kwargs=delete_kwargs,
+ )
+ if not response.is_successful():
+ raise ValueError(
+ f"Delete request failed: {response.status_code}, response payload: {response.json}"
+ )
+ logger.info(f"Deleted node with id {ref_doc_id}")
+
+ async def adelete(
+ self,
+ ref_doc_id: str,
+ namespace: Optional[str] = None,
+ **delete_kwargs: Any,
+ ) -> None:
+ """
+ Delete nodes using with ref_doc_id.
+ NOTE: this is not implemented for all vector stores. If not implemented,
+ it will just call delete synchronously.
+ """
+ logger.info("Async delete not implemented. Will call delete synchronously.")
+ self.delete(ref_doc_id, **delete_kwargs)
+
+ def _create_query_body(
+ self,
+ query: VectorStoreQuery,
+ sources_str: str,
+ rank_profile: Optional[str] = None,
+ create_embedding: bool = True,
+ vector_top_k: int = 10,
+ ) -> dict:
+ """
+ Create query parameters for Vespa.
+
+ Args:
+ query (VectorStoreQuery): VectorStoreQuery object
+ sources_str (str): Sources string
+ rank_profile (Optional[str]): Rank profile to use. If not provided, default rank profile is used.
+ create_embedding (bool): Whether to create embedding
+ vector_top_k (int): Number of top k vectors to return
+
+ Returns:
+ dict: Query parameters
+ """
+ logger.info(f"Query: {query}")
+ if query.filters:
+ logger.warning("Filter support not implemented yet. Will be ignored.")
+ if query.alpha:
+ logger.warning(
+ "Alpha support not implemented. Must be defined in Vespa rank profile. "
+ "See for example https://pyvespa.readthedocs.io/en/latest/examples/evaluating-with-snowflake-arctic-embed.html"
+ )
+
+ if query.query_embedding is None and not create_embedding:
+ raise ValueError(
+ "Input embedding must be provided if embeddings are not created outside Vespa"
+ )
+
+ base_params = {
+ "hits": query.similarity_top_k,
+ "ranking.profile": rank_profile
+ or self._get_default_rank_profile(query.mode),
+ "query": query.query_str,
+ "tracelevel": 9,
+ }
+ logger.debug(query.mode)
+ if query.mode in [
+ VectorStoreQueryMode.TEXT_SEARCH,
+ VectorStoreQueryMode.DEFAULT,
+ ]:
+ query_params = {"yql": f"select * from {sources_str} where userQuery()"}
+ elif query.mode in [
+ VectorStoreQueryMode.SEMANTIC_HYBRID,
+ VectorStoreQueryMode.HYBRID,
+ ]:
+ if not query.embedding_field:
+ embedding_field = "embedding"
+ logger.warning(
+ f"Embedding field not provided. Using default embedding field {embedding_field}"
+ )
+ query_params = {
+ "yql": f"select * from {sources_str} where {self._build_query_filter(query.mode, embedding_field, vector_top_k, query.similarity_top_k)}",
+ "input.query(q)": (
+ f"embed({query.query_str})"
+ if create_embedding
+ else query.query_embedding
+ ),
+ }
+ else:
+ raise NotImplementedError(
+ f"Query mode {query.mode} not implemented for Vespa yet. Contributions are welcome!"
+ )
+
+ return {**base_params, **query_params}
+
+ def _get_default_rank_profile(self, mode):
+ return {
+ VectorStoreQueryMode.TEXT_SEARCH: "bm25",
+ VectorStoreQueryMode.SEMANTIC_HYBRID: "fusion",
+ VectorStoreQueryMode.HYBRID: "fusion",
+ VectorStoreQueryMode.DEFAULT: "bm25",
+ }.get(mode)
+
+ def _build_query_filter(
+ self, mode, embedding_field, vector_top_k, similarity_top_k
+ ):
+ """
+ Build query filter for Vespa query.
+ The part after "select * from {sources_str} where" in the query.
+ """
+ if mode in [
+ VectorStoreQueryMode.SEMANTIC_HYBRID,
+ VectorStoreQueryMode.HYBRID,
+ ]:
+ return f"rank({{targetHits:{vector_top_k}}}nearestNeighbor({embedding_field},q), userQuery()) limit {similarity_top_k}"
+ else:
+ raise ValueError(f"Query mode {mode} not supported.")
+
+ def query(
+ self,
+ query: VectorStoreQuery,
+ sources: Optional[List[str]] = None,
+ rank_profile: Optional[str] = None,
+ vector_top_k: int = 10,
+ **kwargs: Any,
+ ) -> VectorStoreQueryResult:
+ """Query vector store."""
+ logger.debug(f"Query: {query}")
+ sources_str = ",".join(sources) if sources else "sources *"
+ mode = query.mode
+ body = self._create_query_body(
+ query=query,
+ sources_str=sources_str,
+ rank_profile=rank_profile,
+ create_embedding=not self.embeddings_outside_vespa,
+ vector_top_k=vector_top_k,
+ )
+ logger.info(f"Vespa Query body:\n {body}")
+ with self.app.syncio() as session:
+ response = session.query(
+ body=body,
+ )
+ if not response.is_successful():
+ raise ValueError(
+ f"Query request failed: {response.status_code}, response payload: {response.get_json()}"
+ )
+ logger.debug("Response:")
+ logger.debug(response.json)
+ logger.debug("Hits:")
+ logger.debug(response.hits)
+ nodes = []
+ ids: List[str] = []
+ similarities: List[float] = []
+ for hit in response.hits:
+ response_fields: dict = hit.get("fields", {})
+ metadata = response_fields.get("metadata", {})
+ metadata = json.loads(metadata)
+ logger.debug(f"Metadata: {metadata}")
+ node = metadata_dict_to_node(metadata)
+ text = response_fields.get("body", "")
+ node.set_content(text)
+ nodes.append(node)
+ ids.append(response_fields.get("id"))
+ similarities.append(hit["relevance"])
+ return VectorStoreQueryResult(nodes=nodes, ids=ids, similarities=similarities)
+
+ async def aquery(
+ self,
+ query: VectorStoreQuery,
+ sources: Optional[List[str]] = None,
+ rank_profile: Optional[str] = None,
+ vector_top_k: int = 10,
+ **kwargs: Any,
+ ) -> VectorStoreQueryResult:
+ """
+ Asynchronously query vector store.
+ NOTE: this is not implemented for all vector stores. If not implemented,
+ it will just call query synchronously.
+ """
+ logger.info("Async query not implemented. Will call query synchronously.")
+ return self.query(
+ query=query,
+ sources=sources,
+ rank_profile=rank_profile,
+ vector_top_k=vector_top_k,
+ **kwargs,
+ )
+
+ def persist(
+ self,
+ ) -> None:
+ return NotImplemented("Persist is not implemented for VespaVectorStore")
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/templates.py b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/templates.py
new file mode 100644
index 0000000000000..18570bd9e620c
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/llama_index/vector_stores/vespa/templates.py
@@ -0,0 +1,96 @@
+try:
+ from vespa.package import (
+ ApplicationPackage,
+ Field,
+ Schema,
+ Document,
+ HNSW,
+ RankProfile,
+ Component,
+ Parameter,
+ FieldSet,
+ GlobalPhaseRanking,
+ Function,
+ )
+except ImportError:
+ raise ModuleNotFoundError(
+ "pyvespa not installed. Please install it via `pip install pyvespa`"
+ )
+
+hybrid_template = ApplicationPackage(
+ name="hybridsearch",
+ schema=[
+ Schema(
+ name="doc",
+ document=Document(
+ fields=[
+ Field(name="id", type="string", indexing=["summary"]),
+ Field(name="metadata", type="string", indexing=["summary"]),
+ Field(
+ name="text",
+ type="string",
+ indexing=["index", "summary"],
+ index="enable-bm25",
+ bolding=True,
+ ),
+ Field(
+ name="embedding",
+ type="tensor(x[384])",
+ indexing=[
+ "input text",
+ "embed",
+ "index",
+ "attribute",
+ ],
+ ann=HNSW(distance_metric="angular"),
+ is_document_field=False,
+ ),
+ ]
+ ),
+ fieldsets=[FieldSet(name="default", fields=["text", "metadata"])],
+ rank_profiles=[
+ RankProfile(
+ name="bm25",
+ inputs=[("query(q)", "tensor(x[384])")],
+ functions=[Function(name="bm25sum", expression="bm25(text)")],
+ first_phase="bm25sum",
+ ),
+ RankProfile(
+ name="semantic",
+ inputs=[("query(q)", "tensor(x[384])")],
+ first_phase="closeness(field, embedding)",
+ ),
+ RankProfile(
+ name="fusion",
+ inherits="bm25",
+ inputs=[("query(q)", "tensor(x[384])")],
+ first_phase="closeness(field, embedding)",
+ global_phase=GlobalPhaseRanking(
+ expression="reciprocal_rank_fusion(bm25sum, closeness(field, embedding))",
+ rerank_count=1000,
+ ),
+ ),
+ ],
+ )
+ ],
+ components=[
+ Component(
+ id="e5",
+ type="hugging-face-embedder",
+ parameters=[
+ Parameter(
+ "transformer-model",
+ {
+ "url": "https://github.com/vespa-engine/sample-apps/raw/master/simple-semantic-search/model/e5-small-v2-int8.onnx"
+ },
+ ),
+ Parameter(
+ "tokenizer-model",
+ {
+ "url": "https://raw.githubusercontent.com/vespa-engine/sample-apps/master/simple-semantic-search/model/tokenizer.json"
+ },
+ ),
+ ],
+ )
+ ],
+)
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/pyproject.toml b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/pyproject.toml
new file mode 100644
index 0000000000000..9830a544a05c4
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/pyproject.toml
@@ -0,0 +1,65 @@
+[build-system]
+build-backend = "poetry.core.masonry.api"
+requires = ["poetry-core"]
+
+[tool.codespell]
+check-filenames = true
+check-hidden = true
+skip = "*.csv,*.html,*.json,*.jsonl,*.pdf,*.txt,*.ipynb"
+
+[tool.llamahub]
+contains_example = true
+import_path = "llama_index.vector_stores.vespa"
+
+[tool.llamahub.class_authors]
+VespaVectorStore = "thomasht86"
+
+[tool.mypy]
+disallow_untyped_defs = true
+exclude = ["_static", "build", "examples", "notebooks", "venv"]
+ignore_missing_imports = true
+python_version = "3.8"
+
+[tool.poetry]
+authors = ["Thomas Thoresen "]
+description = "llama-index vector_stores vespa integration"
+exclude = ["**/BUILD"]
+license = "MIT"
+name = "llama-index-vector-stores-vespa"
+readme = "README.md"
+version = "0.0.1"
+
+[tool.poetry.dependencies]
+python = ">=3.8.1,<4.0"
+llama-index-core = "^0.10.1"
+pyvespa = "^0.40.0"
+
+[tool.poetry.group.dev.dependencies]
+ipython = "8.10.0"
+jupyter = "^1.0.0"
+mypy = "0.991"
+pre-commit = "3.2.0"
+pylint = "2.15.10"
+pytest = "7.2.1"
+pytest-asyncio = "0.23.6"
+pytest-mock = "3.11.1"
+pyvespa = "0.40.0"
+ruff = "0.0.292"
+tree-sitter-languages = "^1.8.0"
+types-Deprecated = ">=0.1.0"
+types-PyYAML = "^6.0.12.12"
+types-protobuf = "^4.24.0.4"
+types-redis = "4.5.5.0"
+types-requests = "2.28.11.8"
+types-setuptools = "67.1.0.0"
+
+[tool.poetry.group.dev.dependencies.black]
+extras = ["jupyter"]
+version = "<=23.9.1,>=23.7.0"
+
+[tool.poetry.group.dev.dependencies.codespell]
+extras = ["toml"]
+version = ">=v2.2.6"
+
+[[tool.poetry.packages]]
+include = "llama_index/"
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/BUILD b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/BUILD
new file mode 100644
index 0000000000000..dabf212d7e716
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/BUILD
@@ -0,0 +1 @@
+python_tests()
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/__init__.py b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/__init__.py
new file mode 100644
index 0000000000000..e69de29bb2d1d
diff --git a/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/test_vespavectorstore.py b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/test_vespavectorstore.py
new file mode 100644
index 0000000000000..44bbc6b90d5b4
--- /dev/null
+++ b/llama-index-integrations/vector_stores/llama-index-vector-stores-vespa/tests/test_vespavectorstore.py
@@ -0,0 +1,154 @@
+import asyncio
+import pytest
+
+from llama_index.core.schema import TextNode
+from llama_index.core.vector_stores.types import (
+ VectorStoreQuery,
+ VectorStoreQueryMode,
+)
+from vespa.application import ApplicationPackage
+from llama_index.vector_stores.vespa import VespaVectorStore, hybrid_template
+
+try:
+ # Should be installed as pyvespa-dependency
+ import docker
+
+ client = docker.from_env()
+ docker_available = client.ping()
+except Exception:
+ docker_available = False
+
+
+# Assuming Vespa services are mocked or local Vespa Docker is used
+@pytest.fixture(scope="session")
+def vespa_app():
+ app_package: ApplicationPackage = hybrid_template
+ return VespaVectorStore(application_package=app_package, deployment_target="local")
+
+
+@pytest.fixture(scope="session")
+def nodes() -> list:
+ return [
+ TextNode(
+ text="The Shawshank Redemption",
+ metadata={
+ "id": "1",
+ "author": "Stephen King",
+ "theme": "Friendship",
+ "year": 1994,
+ },
+ ),
+ TextNode(
+ text="The Godfather",
+ metadata={
+ "id": "2",
+ "director": "Francis Ford Coppola",
+ "theme": "Mafia",
+ "year": 1972,
+ },
+ ),
+ TextNode(
+ text="Inception",
+ metadata={
+ "id": "3",
+ "director": "Christopher Nolan",
+ "theme": "Fiction",
+ "year": 2010,
+ },
+ ),
+ TextNode(
+ text="To Kill a Mockingbird",
+ metadata={
+ "id": "4",
+ "author": "Harper Lee",
+ "theme": "Mafia",
+ "year": 1960,
+ },
+ ),
+ TextNode(
+ text="1984",
+ metadata={
+ "id": "5",
+ "author": "George Orwell",
+ "theme": "Totalitarianism",
+ "year": 1949,
+ },
+ ),
+ TextNode(
+ text="The Great Gatsby",
+ metadata={
+ "id": "6",
+ "author": "F. Scott Fitzgerald",
+ "theme": "The American Dream",
+ "year": 1925,
+ },
+ ),
+ TextNode(
+ text="Harry Potter and the Sorcerer's Stone",
+ metadata={
+ "id": "7",
+ "author": "J.K. Rowling",
+ "theme": "Fiction",
+ "year": 1997,
+ },
+ ),
+ ]
+
+
+@pytest.fixture(scope="session")
+def added_node_ids(vespa_app, nodes):
+ return vespa_app.add(nodes)
+ # Assume returned `inserted_ids` is a list of IDs that match the order of `nodes`
+
+
+@pytest.mark.skipif(not docker_available, reason="Docker not available")
+def test_query_text_search(vespa_app, added_node_ids):
+ query = VectorStoreQuery(
+ query_str="Inception", # Ensure the query matches the case used in the nodes
+ mode="text_search",
+ similarity_top_k=1,
+ )
+ result = vespa_app.query(query)
+ assert len(result.nodes) == 1
+ node_metadata = result.nodes[0].metadata
+ assert node_metadata["id"] == "3", "Expected Inception node"
+
+
+@pytest.mark.skipif(not docker_available, reason="Docker not available")
+def test_query_vector_search(vespa_app, added_node_ids):
+ query = VectorStoreQuery(
+ query_str="magic, wizardry",
+ mode="semantic_hybrid",
+ similarity_top_k=1,
+ )
+ result = vespa_app.query(query)
+ assert len(result.nodes) == 1, "Expected 1 result"
+ node_metadata = result.nodes[0].metadata
+ print(node_metadata)
+ assert node_metadata["id"] == "7", "Expected Harry Potter node"
+
+
+@pytest.mark.skipif(not docker_available, reason="Docker not available")
+def test_delete_node(vespa_app, added_node_ids):
+ # Testing the deletion of a node
+ vespa_app.delete(ref_doc_id=added_node_ids[1])
+ query = VectorStoreQuery(
+ query_str="Godfather",
+ mode=VectorStoreQueryMode.TEXT_SEARCH,
+ similarity_top_k=1,
+ )
+ result = vespa_app.query(query)
+ assert (
+ len(result.nodes) == 0
+ ), f"Deleted node still present in the vector store: {result.nodes}"
+
+
+@pytest.mark.skipif(not docker_available, reason="Docker not available")
+@pytest.mark.asyncio()
+async def test_async_add_and_query(vespa_app, nodes):
+ # Testing async add and query
+ await asyncio.gather(*[vespa_app.async_add(nodes)])
+ query = VectorStoreQuery(query_str="Harry Potter", similarity_top_k=1)
+ result = await vespa_app.aquery(query)
+ assert len(result.nodes) == 1
+ assert result.nodes[0].node_id == "7"
diff --git a/llama-index-networks/poetry.lock b/llama-index-networks/poetry.lock
index b9c899e7c38ec..4791d19423d05 100644
--- a/llama-index-networks/poetry.lock
+++ b/llama-index-networks/poetry.lock
@@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 1.6.1 and should not be changed by hand.
+# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
[[package]]
name = "aiohttp"
@@ -1536,13 +1536,13 @@ testing = ["Django", "attrs", "colorama", "docopt", "pytest (<7.0.0)"]
[[package]]
name = "jinja2"
-version = "3.1.3"
+version = "3.1.4"
description = "A very fast and expressive template engine."
optional = false
python-versions = ">=3.7"
files = [
- {file = "Jinja2-3.1.3-py3-none-any.whl", hash = "sha256:7d6d50dd97d52cbc355597bd845fabfbac3f551e1f99619e39a35ce8c370b5fa"},
- {file = "Jinja2-3.1.3.tar.gz", hash = "sha256:ac8bd6544d4bb2c9792bf3a159e80bba8fda7f07e81bc3aed565432d5925ba90"},
+ {file = "jinja2-3.1.4-py3-none-any.whl", hash = "sha256:bc5dd2abb727a5319567b7a813e6a2e7318c39f4f487cfe6c89c6f9c7d25197d"},
+ {file = "jinja2-3.1.4.tar.gz", hash = "sha256:4a3aee7acbbe7303aede8e9648d13b8bf88a429282aa6122a993f0ac800cb369"},
]
[package.dependencies]
@@ -3459,6 +3459,7 @@ files = [
{file = "PyYAML-6.0.1-cp311-cp311-win_amd64.whl", hash = "sha256:bf07ee2fef7014951eeb99f56f39c9bb4af143d8aa3c21b1677805985307da34"},
{file = "PyYAML-6.0.1-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:855fb52b0dc35af121542a76b9a84f8d1cd886ea97c84703eaa6d88e37a2ad28"},
{file = "PyYAML-6.0.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:40df9b996c2b73138957fe23a16a4f0ba614f4c0efce1e9406a184b6d07fa3a9"},
+ {file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a08c6f0fe150303c1c6b71ebcd7213c2858041a7e01975da3a99aed1e7a378ef"},
{file = "PyYAML-6.0.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6c22bec3fbe2524cde73d7ada88f6566758a8f7227bfbf93a408a9d86bcc12a0"},
{file = "PyYAML-6.0.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:8d4e9c88387b0f5c7d5f281e55304de64cf7f9c0021a3525bd3b1c542da3b0e4"},
{file = "PyYAML-6.0.1-cp312-cp312-win32.whl", hash = "sha256:d483d2cdf104e7c9fa60c544d92981f12ad66a457afae824d146093b8c294c54"},
@@ -4306,13 +4307,13 @@ files = [
[[package]]
name = "tqdm"
-version = "4.66.2"
+version = "4.66.3"
description = "Fast, Extensible Progress Meter"
optional = false
python-versions = ">=3.7"
files = [
- {file = "tqdm-4.66.2-py3-none-any.whl", hash = "sha256:1ee4f8a893eb9bef51c6e35730cebf234d5d0b6bd112b0271e10ed7c24a02bd9"},
- {file = "tqdm-4.66.2.tar.gz", hash = "sha256:6cd52cdf0fef0e0f543299cfc96fec90d7b8a7e88745f411ec33eb44d5ed3531"},
+ {file = "tqdm-4.66.3-py3-none-any.whl", hash = "sha256:4f41d54107ff9a223dca80b53efe4fb654c67efaba7f47bada3ee9d50e05bd53"},
+ {file = "tqdm-4.66.3.tar.gz", hash = "sha256:23097a41eba115ba99ecae40d06444c15d1c0c698d527a01c6c8bd1c5d0647e5"},
]
[package.dependencies]
diff --git a/llama-index-packs/llama-index-packs-diff-private-simple-dataset/poetry.lock b/llama-index-packs/llama-index-packs-diff-private-simple-dataset/poetry.lock
index dca3dbecfd509..c119d2b4b8db6 100644
--- a/llama-index-packs/llama-index-packs-diff-private-simple-dataset/poetry.lock
+++ b/llama-index-packs/llama-index-packs-diff-private-simple-dataset/poetry.lock
@@ -1394,13 +1394,13 @@ testing = ["Django", "attrs", "colorama", "docopt", "pytest (<7.0.0)"]
[[package]]
name = "jinja2"
-version = "3.1.3"
+version = "3.1.4"
description = "A very fast and expressive template engine."
optional = false
python-versions = ">=3.7"
files = [
- {file = "Jinja2-3.1.3-py3-none-any.whl", hash = "sha256:7d6d50dd97d52cbc355597bd845fabfbac3f551e1f99619e39a35ce8c370b5fa"},
- {file = "Jinja2-3.1.3.tar.gz", hash = "sha256:ac8bd6544d4bb2c9792bf3a159e80bba8fda7f07e81bc3aed565432d5925ba90"},
+ {file = "jinja2-3.1.4-py3-none-any.whl", hash = "sha256:bc5dd2abb727a5319567b7a813e6a2e7318c39f4f487cfe6c89c6f9c7d25197d"},
+ {file = "jinja2-3.1.4.tar.gz", hash = "sha256:4a3aee7acbbe7303aede8e9648d13b8bf88a429282aa6122a993f0ac800cb369"},
]
[package.dependencies]
@@ -3950,13 +3950,13 @@ files = [
[[package]]
name = "tqdm"
-version = "4.66.2"
+version = "4.66.3"
description = "Fast, Extensible Progress Meter"
optional = false
python-versions = ">=3.7"
files = [
- {file = "tqdm-4.66.2-py3-none-any.whl", hash = "sha256:1ee4f8a893eb9bef51c6e35730cebf234d5d0b6bd112b0271e10ed7c24a02bd9"},
- {file = "tqdm-4.66.2.tar.gz", hash = "sha256:6cd52cdf0fef0e0f543299cfc96fec90d7b8a7e88745f411ec33eb44d5ed3531"},
+ {file = "tqdm-4.66.3-py3-none-any.whl", hash = "sha256:4f41d54107ff9a223dca80b53efe4fb654c67efaba7f47bada3ee9d50e05bd53"},
+ {file = "tqdm-4.66.3.tar.gz", hash = "sha256:23097a41eba115ba99ecae40d06444c15d1c0c698d527a01c6c8bd1c5d0647e5"},
]
[package.dependencies]
diff --git a/llama-index-packs/llama-index-packs-koda-retriever/poetry.lock b/llama-index-packs/llama-index-packs-koda-retriever/poetry.lock
index 66525f1d81913..e0e07d418e30c 100644
--- a/llama-index-packs/llama-index-packs-koda-retriever/poetry.lock
+++ b/llama-index-packs/llama-index-packs-koda-retriever/poetry.lock
@@ -1394,13 +1394,13 @@ testing = ["Django", "attrs", "colorama", "docopt", "pytest (<7.0.0)"]
[[package]]
name = "jinja2"
-version = "3.1.3"
+version = "3.1.4"
description = "A very fast and expressive template engine."
optional = false
python-versions = ">=3.7"
files = [
- {file = "Jinja2-3.1.3-py3-none-any.whl", hash = "sha256:7d6d50dd97d52cbc355597bd845fabfbac3f551e1f99619e39a35ce8c370b5fa"},
- {file = "Jinja2-3.1.3.tar.gz", hash = "sha256:ac8bd6544d4bb2c9792bf3a159e80bba8fda7f07e81bc3aed565432d5925ba90"},
+ {file = "jinja2-3.1.4-py3-none-any.whl", hash = "sha256:bc5dd2abb727a5319567b7a813e6a2e7318c39f4f487cfe6c89c6f9c7d25197d"},
+ {file = "jinja2-3.1.4.tar.gz", hash = "sha256:4a3aee7acbbe7303aede8e9648d13b8bf88a429282aa6122a993f0ac800cb369"},
]
[package.dependencies]
@@ -3898,13 +3898,13 @@ files = [
[[package]]
name = "tqdm"
-version = "4.66.2"
+version = "4.66.3"
description = "Fast, Extensible Progress Meter"
optional = false
python-versions = ">=3.7"
files = [
- {file = "tqdm-4.66.2-py3-none-any.whl", hash = "sha256:1ee4f8a893eb9bef51c6e35730cebf234d5d0b6bd112b0271e10ed7c24a02bd9"},
- {file = "tqdm-4.66.2.tar.gz", hash = "sha256:6cd52cdf0fef0e0f543299cfc96fec90d7b8a7e88745f411ec33eb44d5ed3531"},
+ {file = "tqdm-4.66.3-py3-none-any.whl", hash = "sha256:4f41d54107ff9a223dca80b53efe4fb654c67efaba7f47bada3ee9d50e05bd53"},
+ {file = "tqdm-4.66.3.tar.gz", hash = "sha256:23097a41eba115ba99ecae40d06444c15d1c0c698d527a01c6c8bd1c5d0647e5"},
]
[package.dependencies]
diff --git a/llama-index-packs/llama-index-packs-resume-screener/llama_index/packs/resume_screener/base.py b/llama-index-packs/llama-index-packs-resume-screener/llama_index/packs/resume_screener/base.py
index 249aa09181524..01b45bbad9072 100644
--- a/llama-index-packs/llama-index-packs-resume-screener/llama_index/packs/resume_screener/base.py
+++ b/llama-index-packs/llama-index-packs-resume-screener/llama_index/packs/resume_screener/base.py
@@ -7,7 +7,7 @@
from llama_index.core.schema import NodeWithScore
from llama_index.llms.openai import OpenAI
from llama_index.readers.file import PDFReader
-from pydantic import BaseModel, Field
+from llama_index.core.bridge.pydantic import BaseModel, Field
# backwards compatibility
try:
diff --git a/llama-index-packs/llama-index-packs-tables/llama_index/packs/tables/chain_of_table/base.py b/llama-index-packs/llama-index-packs-tables/llama_index/packs/tables/chain_of_table/base.py
index 8889c6d4637f8..98524e1b021df 100644
--- a/llama-index-packs/llama-index-packs-tables/llama_index/packs/tables/chain_of_table/base.py
+++ b/llama-index-packs/llama-index-packs-tables/llama_index/packs/tables/chain_of_table/base.py
@@ -647,7 +647,7 @@ def __init__(
super().__init__(table=table, llm=llm, verbose=verbose, **kwargs)
def custom_query(self, query_str: str) -> Response:
- """Run chain of thought query engine."""
+ """Run chain of table query engine."""
op_chain = []
dynamic_plan_parser = FnComponent(fn=_dynamic_plan_parser)