From 6d2e8006773100be0d75fab179e048562f45b0e9 Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Fri, 2 Aug 2024 11:39:01 -0700 Subject: [PATCH 1/8] docs[minor]: Update CSV doc (#6345) * docs[minor]: Update CSV doc * drop old file --- .../document_loaders/file_loaders/csv.ipynb | 226 ++++++++++++++++++ .../document_loaders/file_loaders/csv.mdx | 90 ------- 2 files changed, 226 insertions(+), 90 deletions(-) create mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.ipynb delete mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.mdx diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.ipynb b/docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.ipynb new file mode 100644 index 000000000000..5f0f34c143d5 --- /dev/null +++ b/docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.ipynb @@ -0,0 +1,226 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "sidebar_label: CSV\n", + "sidebar_class_name: node-only\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# CSVLoader\n", + "\n", + "```{=mdx}\n", + "\n", + ":::tip Compatibility\n", + "\n", + "Only available on Node.js.\n", + "\n", + ":::\n", + "\n", + "```\n", + "\n", + "This notebook provides a quick overview for getting started with `CSVLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `CSVLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_csv.CSVLoader.html).\n", + "\n", + "This example goes over how to load data from CSV files. The second argument is the `column` name to extract from the CSV file. One document will be created for each row in the CSV file. When `column` is not specified, each row is converted into a key/value pair with each key/value pair outputted to a new line in the document's `pageContent`. When `column` is specified, one document is created for each row, and the value of the specified column is used as the document's `pageContent`.\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Compatibility | Local | [PY support](https://python.langchain.com/docs/integrations/document_loaders/csv)| \n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [CSVLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_csv.CSVLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_fs_csv.html) | Node-only | ✅ | ✅ |\n", + "\n", + "## Setup\n", + "\n", + "To access `CSVLoader` document loader you'll need to install the `@langchain/community` integration, along with the `d3-dsv@2` peer dependency.\n", + "\n", + "### Installation\n", + "\n", + "The LangChain CSVLoader integration lives in the `@langchain/community` integration package.\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " @langchain/community d3-dsv@2\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import { CSVLoader } from \"@langchain/community/document_loaders/fs/csv\"\n", + "\n", + "const exampleCsvPath = \"../../../../../../langchain/src/document_loaders/tests/example_data/example_separator.csv\";\n", + "\n", + "const loader = new CSVLoader(exampleCsvPath)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: 'id|html: 1|\"Corruption discovered at the core of the Banking Clan!\"',\n", + " metadata: {\n", + " source: '../../../../../../langchain/src/document_loaders/tests/example_data/example_separator.csv',\n", + " line: 1\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "const docs = await loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " source: '../../../../../../langchain/src/document_loaders/tests/example_data/example_separator.csv',\n", + " line: 1\n", + "}\n" + ] + } + ], + "source": [ + "console.log(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage, extracting a single column\n", + "\n", + "Example CSV file:\n", + "\n", + "```csv\n", + "id|html\n", + "1|\"Corruption discovered at the core of the Banking Clan!\"\n", + "2|\"Reunited, Rush Clovis and Senator Amidala\"\n", + "3|\"discover the full extent of the deception.\"\n", + "4|\"Anakin Skywalker is sent to the rescue!\"\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: 'Corruption discovered at the core of the Banking Clan!',\n", + " metadata: {\n", + " source: '../../../../../../langchain/src/document_loaders/tests/example_data/example_separator.csv',\n", + " line: 1\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "import { CSVLoader } from \"@langchain/community/document_loaders/fs/csv\";\n", + "\n", + "const singleColumnLoader = new CSVLoader(\n", + " exampleCsvPath,\n", + " {\n", + " column: \"html\",\n", + " separator:\"|\"\n", + " }\n", + ");\n", + "\n", + "const singleColumnDocs = await singleColumnLoader.load();\n", + "console.log(singleColumnDocs[0]);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all CSVLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_csv.CSVLoader.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.mdx b/docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.mdx deleted file mode 100644 index e9adf18540a8..000000000000 --- a/docs/core_docs/docs/integrations/document_loaders/file_loaders/csv.mdx +++ /dev/null @@ -1,90 +0,0 @@ -# CSV files - -This example goes over how to load data from CSV files. The second argument is the `column` name to extract from the CSV file. One document will be created for each row in the CSV file. When `column` is not specified, each row is converted into a key/value pair with each key/value pair outputted to a new line in the document's `pageContent`. When `column` is specified, one document is created for each row, and the value of the specified column is used as the document's pageContent. - -## Setup - -```bash npm2yarn -npm install d3-dsv@2 -``` - -## Usage, extracting all columns - -Example CSV file: - -```csv -id,text -1,This is a sentence. -2,This is another sentence. -``` - -Example code: - -```typescript -import { CSVLoader } from "@langchain/community/document_loaders/fs/csv"; - -const loader = new CSVLoader("src/document_loaders/example_data/example.csv"); - -const docs = await loader.load(); -/* -[ - Document { - "metadata": { - "line": 1, - "source": "src/document_loaders/example_data/example.csv", - }, - "pageContent": "id: 1 -text: This is a sentence.", - }, - Document { - "metadata": { - "line": 2, - "source": "src/document_loaders/example_data/example.csv", - }, - "pageContent": "id: 2 -text: This is another sentence.", - }, -] -*/ -``` - -## Usage, extracting a single column - -Example CSV file: - -```csv -id,text -1,This is a sentence. -2,This is another sentence. -``` - -Example code: - -```typescript -import { CSVLoader } from "@langchain/community/document_loaders/fs/csv"; - -const loader = new CSVLoader( - "src/document_loaders/example_data/example.csv", - "text" -); - -const docs = await loader.load(); -/* -[ - Document { - "metadata": { - "line": 1, - "source": "src/document_loaders/example_data/example.csv", - }, - "pageContent": "This is a sentence.", - }, - Document { - "metadata": { - "line": 2, - "source": "src/document_loaders/example_data/example.csv", - }, - "pageContent": "This is another sentence.", - }, -] -*/ -``` From 668588eb6acca3c3af1a467494ba5a04a88c18f1 Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Fri, 2 Aug 2024 11:41:32 -0700 Subject: [PATCH 2/8] docs[minor]: Update fs pdf loader doc (#6342) * docs[minor]: Update fs pdf loader doc * cr --- .../document_loaders/file_loaders/pdf.ipynb | 502 ++++++++++++++++++ .../document_loaders/file_loaders/pdf.mdx | 72 --- 2 files changed, 502 insertions(+), 72 deletions(-) create mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.ipynb delete mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.mdx diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.ipynb b/docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.ipynb new file mode 100644 index 000000000000..ac0092586134 --- /dev/null +++ b/docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.ipynb @@ -0,0 +1,502 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "sidebar_label: PDFLoader\n", + "sidebar_class_name: node-only\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# PDFLoader\n", + "\n", + "```{=mdx}\n", + "\n", + ":::tip Compatibility\n", + "\n", + "Only available on Node.js.\n", + "\n", + ":::\n", + "\n", + "```\n", + "\n", + "This notebook provides a quick overview for getting started with `PDFLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `PDFLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Compatibility | Local | PY support | \n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [PDFLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_fs_pdf.html) | Node-only | ✅ | 🟠 (See note below) |\n", + "\n", + "> The Python package has many PDF loaders to choose from. See [this link](https://python.langchain.com/docs/integrations/document_loaders/) for a full list of Python document loaders.\n", + "\n", + "## Setup\n", + "\n", + "To access `PDFLoader` document loader you'll need to install the `@langchain/community` integration, along with the `pdf-parse` package.\n", + "\n", + "### Credentials\n", + "\n", + "### Installation\n", + "\n", + "The LangChain PDFLoader integration lives in the `@langchain/community` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " @langchain/community pdf-parse\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "import { PDFLoader } from \"@langchain/community/document_loaders/fs/pdf\"\n", + "\n", + "const nike10kPdfPath = \"../../../../data/nke-10k-2023.pdf\"\n", + "\n", + "const loader = new PDFLoader(nike10kPdfPath)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: 'Table of Contents\\n' +\n", + " 'UNITED STATES\\n' +\n", + " 'SECURITIES AND EXCHANGE COMMISSION\\n' +\n", + " 'Washington, D.C. 20549\\n' +\n", + " 'FORM 10-K\\n' +\n", + " '(Mark One)\\n' +\n", + " '☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\\n' +\n", + " 'FOR THE FISCAL YEAR ENDED MAY 31, 2023\\n' +\n", + " 'OR\\n' +\n", + " '☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\\n' +\n", + " 'FOR THE TRANSITION PERIOD FROM TO .\\n' +\n", + " 'Commission File No. 1-10635\\n' +\n", + " 'NIKE, Inc.\\n' +\n", + " '(Exact name of Registrant as specified in its charter)\\n' +\n", + " 'Oregon93-0584541\\n' +\n", + " '(State or other jurisdiction of incorporation)(IRS Employer Identification No.)\\n' +\n", + " 'One Bowerman Drive, Beaverton, Oregon 97005-6453\\n' +\n", + " '(Address of principal executive offices and zip code)\\n' +\n", + " '(503) 671-6453\\n' +\n", + " \"(Registrant's telephone number, including area code)\\n\" +\n", + " 'SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:\\n' +\n", + " 'Class B Common StockNKENew York Stock Exchange\\n' +\n", + " '(Title of each class)(Trading symbol)(Name of each exchange on which registered)\\n' +\n", + " 'SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:\\n' +\n", + " 'NONE\\n' +\n", + " 'Indicate by check mark:YESNO\\n' +\n", + " '•if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.þ ̈\\n' +\n", + " '•if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. ̈þ\\n' +\n", + " '•whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding\\n' +\n", + " '12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the\\n' +\n", + " 'past 90 days.\\n' +\n", + " 'þ ̈\\n' +\n", + " '•whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T\\n' +\n", + " '(§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).\\n' +\n", + " 'þ ̈\\n' +\n", + " '•whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company or an emerging growth company. See the definitions of “large accelerated filer,”\\n' +\n", + " '“accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\\n' +\n", + " 'Large accelerated filerþAccelerated filer☐Non-accelerated filer☐Smaller reporting company☐Emerging growth company☐\\n' +\n", + " '•if an emerging growth company, if the registrant has elected not to use the extended transition period for complying with any new or revised financial\\n' +\n", + " 'accounting standards provided pursuant to Section 13(a) of the Exchange Act.\\n' +\n", + " ' ̈\\n' +\n", + " \"•whether the registrant has filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial\\n\" +\n", + " 'reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit\\n' +\n", + " 'report.\\n' +\n", + " 'þ\\n' +\n", + " '•if securities are registered pursuant to Section 12(b) of the Act, whether the financial statements of the registrant included in the filing reflect the\\n' +\n", + " 'correction of an error to previously issued financial statements.\\n' +\n", + " ' ̈\\n' +\n", + " '•whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the\\n' +\n", + " \"registrant's executive officers during the relevant recovery period pursuant to § 240.10D-1(b).\\n\" +\n", + " ' ̈\\n' +\n", + " '•\\n' +\n", + " 'whether the registrant is a shell company (as defined in Rule 12b-2 of the Act).☐þ\\n' +\n", + " \"As of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:\\n\" +\n", + " 'Class A$7,831,564,572 \\n' +\n", + " 'Class B136,467,702,472 \\n' +\n", + " '$144,299,267,044 ',\n", + " metadata: {\n", + " source: '../../../../data/nke-10k-2023.pdf',\n", + " pdf: {\n", + " version: '1.10.100',\n", + " info: [Object],\n", + " metadata: null,\n", + " totalPages: 107\n", + " },\n", + " loc: { pageNumber: 1 }\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "const docs = await loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " source: '../../../../data/nke-10k-2023.pdf',\n", + " pdf: {\n", + " version: '1.10.100',\n", + " info: {\n", + " PDFFormatVersion: '1.4',\n", + " IsAcroFormPresent: false,\n", + " IsXFAPresent: false,\n", + " Title: '0000320187-23-000039',\n", + " Author: 'EDGAR Online, a division of Donnelley Financial Solutions',\n", + " Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',\n", + " Keywords: '0000320187-23-000039; ; 10-K',\n", + " Creator: 'EDGAR Filing HTML Converter',\n", + " Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',\n", + " CreationDate: \"D:20230720162200-04'00'\",\n", + " ModDate: \"D:20230720162208-04'00'\"\n", + " },\n", + " metadata: null,\n", + " totalPages: 107\n", + " },\n", + " loc: { pageNumber: 1 }\n", + "}\n" + ] + } + ], + "source": [ + "console.log(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage, one document per file" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Table of Contents\n", + "UNITED STATES\n", + "SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "FORM 10-K\n", + "\n" + ] + } + ], + "source": [ + "import { PDFLoader } from \"@langchain/community/document_loaders/fs/pdf\";\n", + "\n", + "const singleDocPerFileLoader = new PDFLoader(nike10kPdfPath, {\n", + " splitPages: false,\n", + "});\n", + "\n", + "const singleDoc = await singleDocPerFileLoader.load();\n", + "console.log(singleDoc[0].pageContent.slice(0, 100))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Usage, custom `pdfjs` build\n", + "\n", + "By default we use the `pdfjs` build bundled with `pdf-parse`, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of `pdfjs-dist` or if you want to use a custom build of `pdfjs-dist`, you can do so by providing a custom `pdfjs` function that returns a promise that resolves to the `PDFJS` object.\n", + "\n", + "In the following example we use the \"legacy\" (see [pdfjs docs](https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#which-browsersenvironments-are-supported)) build of `pdfjs-dist`, which includes several polyfills not included in the default build.\n", + "\n", + "```{=mdx}\n", + "\n", + " pdfjs-dist\n", + "\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import { PDFLoader } from \"@langchain/community/document_loaders/fs/pdf\";\n", + "\n", + "const customBuildLoader = new PDFLoader(nike10kPdfPath, {\n", + " // you may need to add `.then(m => m.default)` to the end of the import\n", + " // @lc-ts-ignore\n", + " pdfjs: () => import(\"pdfjs-dist/legacy/build/pdf.js\"),\n", + "});" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eliminating extra spaces\n", + "\n", + "PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but\n", + "if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Mark One)\n", + "☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\n", + "FOR THE FISCAL YEAR ENDED MAY 31, 2023\n", + "OR\n", + "☐ TRANSITI\n" + ] + } + ], + "source": [ + "import { PDFLoader } from \"@langchain/community/document_loaders/fs/pdf\";\n", + "\n", + "const noExtraSpacesLoader = new PDFLoader(nike10kPdfPath, {\n", + " parsedItemSeparator: \"\",\n", + "});\n", + "\n", + "const noExtraSpacesDocs = await noExtraSpacesLoader.load();\n", + "console.log(noExtraSpacesDocs[0].pageContent.slice(100, 250))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading directories" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt\n", + "Unknown file type: example.txt\n", + "Unknown file type: notion.md\n", + "Unknown file type: bad_frontmatter.md\n", + "Unknown file type: frontmatter.md\n", + "Unknown file type: no_frontmatter.md\n", + "Unknown file type: no_metadata.md\n", + "Unknown file type: tags_and_frontmatter.md\n", + "Unknown file type: test.mp3\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\\n' +\n", + " 'Satoshi Nakamoto\\n' +\n", + " 'satoshin@gmx.com\\n' +\n", + " 'www.bitcoin.org\\n' +\n", + " 'Abstract. A purely peer-to-peer version of electronic cash would allow online \\n' +\n", + " 'payments to be sent directly from one party to another without going through a \\n' +\n", + " 'financial institution. Digital signatures provide part of the solution, but the main \\n' +\n", + " 'benefits are lost if a trusted third party is still required to prevent double-spending. \\n' +\n", + " 'We propose a solution to the double-spending problem using a peer-to-peer network. \\n' +\n", + " 'The network timestamps transactions by hashing them into an ongoing chain of \\n' +\n", + " 'hash-based proof-of-work, forming a record that cannot be changed without redoing \\n' +\n", + " 'the proof-of-work. The longest chain not only serves as proof of the sequence of \\n' +\n", + " 'events witnessed, but proof that it came from the largest pool of CPU power. As \\n' +\n", + " 'long as a majority of CPU power is controlled by nodes that are not cooperating to \\n' +\n", + " \"attack the network, they'll generate the longest chain and outpace attackers. The \\n\" +\n", + " 'network itself requires minimal structure. Messages are broadcast on a best effort \\n' +\n", + " 'basis, and nodes can leave and rejoin the network at will, accepting the longest \\n' +\n", + " 'proof-of-work chain as proof of what happened while they were gone.\\n' +\n", + " '1.Introduction\\n' +\n", + " 'Commerce on the Internet has come to rely almost exclusively on financial institutions serving as \\n' +\n", + " 'trusted third parties to process electronic payments. While the system works well enough for \\n' +\n", + " 'most transactions, it still suffers from the inherent weaknesses of the trust based model. \\n' +\n", + " 'Completely non-reversible transactions are not really possible, since financial institutions cannot \\n' +\n", + " 'avoid mediating disputes. The cost of mediation increases transaction costs, limiting the \\n' +\n", + " 'minimum practical transaction size and cutting off the possibility for small casual transactions, \\n' +\n", + " 'and there is a broader cost in the loss of ability to make non-reversible payments for non-\\n' +\n", + " 'reversible services. With the possibility of reversal, the need for trust spreads. Merchants must \\n' +\n", + " 'be wary of their customers, hassling them for more information than they would otherwise need. \\n' +\n", + " 'A certain percentage of fraud is accepted as unavoidable. These costs and payment uncertainties \\n' +\n", + " 'can be avoided in person by using physical currency, but no mechanism exists to make payments \\n' +\n", + " 'over a communications channel without a trusted party.\\n' +\n", + " 'What is needed is an electronic payment system based on cryptographic proof instead of trust, \\n' +\n", + " 'allowing any two willing parties to transact directly with each other without the need for a trusted \\n' +\n", + " 'third party. Transactions that are computationally impractical to reverse would protect sellers \\n' +\n", + " 'from fraud, and routine escrow mechanisms could easily be implemented to protect buyers. In \\n' +\n", + " 'this paper, we propose a solution to the double-spending problem using a peer-to-peer distributed \\n' +\n", + " 'timestamp server to generate computational proof of the chronological order of transactions. The \\n' +\n", + " 'system is secure as long as honest nodes collectively control more CPU power than any \\n' +\n", + " 'cooperating group of attacker nodes.\\n' +\n", + " '1',\n", + " metadata: {\n", + " source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/bitcoin.pdf',\n", + " pdf: {\n", + " version: '1.10.100',\n", + " info: [Object],\n", + " metadata: null,\n", + " totalPages: 9\n", + " },\n", + " loc: { pageNumber: 1 }\n", + " },\n", + " id: undefined\n", + "}\n", + "Document {\n", + " pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System\\n' +\n", + " 'Satoshi Nakamoto\\n' +\n", + " 'satoshin@gmx.com\\n' +\n", + " 'www.bitcoin.org\\n' +\n", + " 'Abstract. A purely peer-to-peer version of electronic cash would allow online \\n' +\n", + " 'payments to be sent directly from one party to another without going through a \\n' +\n", + " 'financial institution. Digital signatures provide part of the solution, but the main \\n' +\n", + " 'benefits are lost if a trusted third party is still required to prevent double-spending. \\n' +\n", + " 'We propose a solution to the double-spending problem using a peer-to-peer network. \\n' +\n", + " 'The network timestamps transactions by hashing them into an ongoing chain of \\n' +\n", + " 'hash-based proof-of-work, forming a record that cannot be changed without redoing \\n' +\n", + " 'the proof-of-work. The longest chain not only serves as proof of the sequence of \\n' +\n", + " 'events witnessed, but proof that it came from the largest pool of CPU power. As \\n' +\n", + " 'long as a majority of CPU power is controlled by nodes that are not cooperating to',\n", + " metadata: {\n", + " source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/bitcoin.pdf',\n", + " pdf: {\n", + " version: '1.10.100',\n", + " info: [Object],\n", + " metadata: null,\n", + " totalPages: 9\n", + " },\n", + " loc: { pageNumber: 1, lines: [Object] }\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "import { DirectoryLoader } from \"langchain/document_loaders/fs/directory\";\n", + "import { PDFLoader } from \"@langchain/community/document_loaders/fs/pdf\";\n", + "import { RecursiveCharacterTextSplitter } from \"@langchain/textsplitters\";\n", + "\n", + "const exampleDataPath = \"../../../../../../examples/src/document_loaders/example_data/\";\n", + "\n", + "/* Load all PDFs within the specified directory */\n", + "const directoryLoader = new DirectoryLoader(\n", + " exampleDataPath,\n", + " {\n", + " \".pdf\": (path: string) => new PDFLoader(path),\n", + " }\n", + ");\n", + "\n", + "const directoryDocs = await directoryLoader.load();\n", + "\n", + "console.log(directoryDocs[0]);\n", + "\n", + "/* Additional steps : Split text into chunks with any TextSplitter. You can then use it as context or save it to memory afterwards. */\n", + "const textSplitter = new RecursiveCharacterTextSplitter({\n", + " chunkSize: 1000,\n", + " chunkOverlap: 200,\n", + "});\n", + "\n", + "const splitDocs = await textSplitter.splitDocuments(directoryDocs);\n", + "console.log(splitDocs[0]);\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all PDFLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_pdf.PDFLoader.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.mdx b/docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.mdx deleted file mode 100644 index 9e92902d452a..000000000000 --- a/docs/core_docs/docs/integrations/document_loaders/file_loaders/pdf.mdx +++ /dev/null @@ -1,72 +0,0 @@ -# PDF files - -This example goes over how to load data from PDF files. By default, one document will be created for each page in the PDF file, you can change this behavior by setting the `splitPages` option to `false`. - -## Setup - -```bash npm2yarn -npm install pdf-parse -``` - -## Usage, one document per page - -```typescript -import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"; - -const loader = new PDFLoader("src/document_loaders/example_data/example.pdf"); - -const docs = await loader.load(); -``` - -## Usage, one document per file - -```typescript -import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"; - -const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", { - splitPages: false, -}); - -const docs = await loader.load(); -``` - -## Usage, custom `pdfjs` build - -By default we use the `pdfjs` build bundled with `pdf-parse`, which is compatible with most environments, including Node.js and modern browsers. If you want to use a more recent version of `pdfjs-dist` or if you want to use a custom build of `pdfjs-dist`, you can do so by providing a custom `pdfjs` function that returns a promise that resolves to the `PDFJS` object. - -In the following example we use the "legacy" (see [pdfjs docs](https://github.com/mozilla/pdf.js/wiki/Frequently-Asked-Questions#which-browsersenvironments-are-supported)) build of `pdfjs-dist`, which includes several polyfills not included in the default build. - -```bash npm2yarn -npm install pdfjs-dist -``` - -```typescript -import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"; - -const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", { - // you may need to add `.then(m => m.default)` to the end of the import - pdfjs: () => import("pdfjs-dist/legacy/build/pdf.js"), -}); -``` - -## Eliminating extra spaces - -PDFs come in many varieties, which makes reading them a challenge. The loader parses individual text elements and joins them together with a space by default, but -if you are seeing excessive spaces, this may not be the desired behavior. In that case, you can override the separator with an empty string like this: - -```typescript -import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf"; - -const loader = new PDFLoader("src/document_loaders/example_data/example.pdf", { - parsedItemSeparator: "", -}); - -const docs = await loader.load(); -``` - -## Loading directories - -import CodeBlock from "@theme/CodeBlock"; -import MemoryExample from "@examples/document_loaders/pdf_directory.ts"; - -{MemoryExample} From 866e8e0fcbb4eed6874f5c3c5dc1203b789ca574 Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Fri, 2 Aug 2024 11:52:57 -0700 Subject: [PATCH 3/8] docs[minor]: updated DirectoryLoader docs (#6347) * docs[minor]: updated DirectoryLoader docs * cr --- .../file_loaders/directory.ipynb | 192 ++++++++++++++++++ .../file_loaders/directory.mdx | 42 ---- 2 files changed, 192 insertions(+), 42 deletions(-) create mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.ipynb delete mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.mdx diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.ipynb b/docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.ipynb new file mode 100644 index 000000000000..3d19d94677d2 --- /dev/null +++ b/docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.ipynb @@ -0,0 +1,192 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "sidebar_label: DirectoryLoader\n", + "sidebar_class_name: node-only\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# DirectoryLoader\n", + "\n", + "```{=mdx}\n", + "\n", + ":::tip Compatibility\n", + "\n", + "Only available on Node.js.\n", + "\n", + ":::\n", + "\n", + "```\n", + "\n", + "This notebook provides a quick overview for getting started with `DirectoryLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `DirectoryLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_document_loaders_fs_directory.DirectoryLoader.html).\n", + "\n", + "This example goes over how to load data from folders with multiple files. The second argument is a map of file extensions to loader factories. Each file will be passed to the matching loader, and the resulting documents will be concatenated together.\n", + "\n", + "Example folder:\n", + "\n", + "```text\n", + "src/document_loaders/example_data/example/\n", + "├── example.json\n", + "├── example.jsonl\n", + "├── example.txt\n", + "└── example.csv\n", + "```\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Compatibility | Local | PY support | \n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [DirectoryLoader](https://api.js.langchain.com/classes/langchain_document_loaders_fs_directory.DirectoryLoader.html) | [langchain](https://api.js.langchain.com/modules/langchain_document_loaders_fs_directory.html) | Node-only | ✅ | ✅ |\n", + "\n", + "## Setup\n", + "\n", + "To access `DirectoryLoader` document loader you'll need to install the `langchain` package.\n", + "\n", + "### Installation\n", + "\n", + "The LangChain DirectoryLoader integration lives in the `langchain` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " langchain\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import { DirectoryLoader } from \"langchain/document_loaders/fs/directory\";\n", + "import {\n", + " JSONLoader,\n", + " JSONLinesLoader,\n", + "} from \"langchain/document_loaders/fs/json\";\n", + "import { TextLoader } from \"langchain/document_loaders/fs/text\";\n", + "import { CSVLoader } from \"@langchain/community/document_loaders/fs/csv\";\n", + "\n", + "const loader = new DirectoryLoader(\n", + " \"../../../../../../examples/src/document_loaders/example_data\",\n", + " {\n", + " \".json\": (path) => new JSONLoader(path, \"/texts\"),\n", + " \".jsonl\": (path) => new JSONLinesLoader(path, \"/html\"),\n", + " \".txt\": (path) => new TextLoader(path),\n", + " \".csv\": (path) => new CSVLoader(path, \"text\"),\n", + " }\n", + ");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: 'Foo\\nBar\\nBaz\\n\\n',\n", + " metadata: {\n", + " source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/example.txt'\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "const docs = await loader.load()\n", + "// disable console.warn calls\n", + "console.warn = () => {}\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " source: '/Users/bracesproul/code/lang-chain-ai/langchainjs/examples/src/document_loaders/example_data/example.txt'\n", + "}\n" + ] + } + ], + "source": [ + "console.log(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all DirectoryLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_document_loaders_fs_directory.DirectoryLoader.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.mdx b/docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.mdx deleted file mode 100644 index a0c3f67ad700..000000000000 --- a/docs/core_docs/docs/integrations/document_loaders/file_loaders/directory.mdx +++ /dev/null @@ -1,42 +0,0 @@ ---- -sidebar_position: 1 -hide_table_of_contents: true ---- - -# Folders with multiple files - -This example goes over how to load data from folders with multiple files. The second argument is a map of file extensions to loader factories. Each file will be passed to the matching loader, and the resulting documents will be concatenated together. - -Example folder: - -```text -src/document_loaders/example_data/example/ -├── example.json -├── example.jsonl -├── example.txt -└── example.csv -``` - -Example code: - -```typescript -import { DirectoryLoader } from "langchain/document_loaders/fs/directory"; -import { - JSONLoader, - JSONLinesLoader, -} from "langchain/document_loaders/fs/json"; -import { TextLoader } from "langchain/document_loaders/fs/text"; -import { CSVLoader } from "@langchain/community/document_loaders/fs/csv"; - -const loader = new DirectoryLoader( - "src/document_loaders/example_data/example", - { - ".json": (path) => new JSONLoader(path, "/texts"), - ".jsonl": (path) => new JSONLinesLoader(path, "/html"), - ".txt": (path) => new TextLoader(path), - ".csv": (path) => new CSVLoader(path, "text"), - } -); -const docs = await loader.load(); -console.log({ docs }); -``` From 27d1d6fe4a1f1134e69e8af27ef55406b7738e19 Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Fri, 2 Aug 2024 12:22:47 -0700 Subject: [PATCH 4/8] docs[minor]: Update unstructured doc loader (#6344) * docs[minor]: Update unstructured doc loader * cr --- .../file_loaders/unstructured.ipynb | 243 ++++++++++++++++++ .../file_loaders/unstructured.mdx | 32 --- 2 files changed, 243 insertions(+), 32 deletions(-) create mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.ipynb delete mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.mdx diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.ipynb b/docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.ipynb new file mode 100644 index 000000000000..6004fabb0f8a --- /dev/null +++ b/docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.ipynb @@ -0,0 +1,243 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "sidebar_label: Unstructured\n", + "sidebar_class_name: node-only\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# UnstructuredLoader\n", + "\n", + "```{=mdx}\n", + "\n", + ":::tip Compatibility\n", + "\n", + "Only available on Node.js.\n", + "\n", + ":::\n", + "\n", + "```\n", + "\n", + "This notebook provides a quick overview for getting started with `UnstructuredLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `UnstructuredLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Compatibility | Local | [PY support](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) | \n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [UnstructuredLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_fs_unstructured.html) | Node-only | ✅ | ✅ |\n", + "\n", + "## Setup\n", + "\n", + "To access `UnstructuredLoader` document loader you'll need to install the `@langchain/community` integration package, and create an Unstructured account and get an API key.\n", + "\n", + "### Local\n", + "\n", + "You can run Unstructured locally in your computer using Docker. To do so, you need to have Docker installed. You can find the instructions to install Docker [here](https://docs.docker.com/get-docker/).\n", + "\n", + "```bash\n", + "docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0\n", + "```\n", + "\n", + "### Credentials\n", + "\n", + "Head to [unstructured.io](https://unstructured.io/api-key-hosted) to sign up to Unstructured and generate an API key. Once you've done this set the `UNSTRUCTURED_API_KEY` environment variable:\n", + "\n", + "```bash\n", + "export UNSTRUCTURED_API_KEY=\"your-api-key\"\n", + "```\n", + "\n", + "### Installation\n", + "\n", + "The LangChain UnstructuredLoader integration lives in the `@langchain/community` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " @langchain/community\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import { UnstructuredLoader } from \"@langchain/community/document_loaders/fs/unstructured\"\n", + "\n", + "const loader = new UnstructuredLoader(\"../../../../../../examples/src/document_loaders/example_data/notion.md\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: '# Testing the notion markdownloader',\n", + " metadata: {\n", + " filename: 'notion.md',\n", + " languages: [ 'eng' ],\n", + " filetype: 'text/plain',\n", + " category: 'NarrativeText'\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "const docs = await loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " filename: 'notion.md',\n", + " languages: [ 'eng' ],\n", + " filetype: 'text/plain',\n", + " category: 'NarrativeText'\n", + "}\n" + ] + } + ], + "source": [ + "console.log(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Directories\n", + "\n", + "You can also load all of the files in the directory using [`UnstructuredDirectoryLoader`](https://v02.api.js.langchain.com/classes/langchain_document_loaders_fs_unstructured.UnstructuredDirectoryLoader.html), which inherits from [`DirectoryLoader`](/docs/integrations/document_loaders/file_loaders/directory):\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt\n", + "Unknown file type: test.mp3\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "directoryDocs.length: 247\n", + "Document {\n", + " pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System',\n", + " metadata: {\n", + " filetype: 'application/pdf',\n", + " languages: [ 'eng' ],\n", + " page_number: 1,\n", + " filename: 'bitcoin.pdf',\n", + " category: 'Title'\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "import { UnstructuredDirectoryLoader } from \"@langchain/community/document_loaders/fs/unstructured\";\n", + "\n", + "const directoryLoader = new UnstructuredDirectoryLoader(\n", + " \"../../../../../../examples/src/document_loaders/example_data/\",\n", + " {}\n", + ");\n", + "const directoryDocs = await directoryLoader.load();\n", + "console.log(\"directoryDocs.length: \", directoryDocs.length);\n", + "console.log(directoryDocs[0])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all UnstructuredLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.mdx b/docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.mdx deleted file mode 100644 index 7c82029f16de..000000000000 --- a/docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.mdx +++ /dev/null @@ -1,32 +0,0 @@ ---- -hide_table_of_contents: true ---- - -# Unstructured - -This example covers how to use [Unstructured.io](https://unstructured.io/) to load files of many types. Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more. - -## Setup - -You can run Unstructured locally in your computer using Docker. To do so, you need to have Docker installed. You can find the instructions to install Docker [here](https://docs.docker.com/get-docker/). - -```bash -docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0 -``` - -## Usage - -Once Unstructured is running, you can use it to load files from your computer. You can use the following code to load a file from your computer. - -import CodeBlock from "@theme/CodeBlock"; -import Example from "@examples/document_loaders/unstructured.ts"; - -{Example} - -## Directories - -You can also load all of the files in the directory using [`UnstructuredDirectoryLoader`](https://v02.api.js.langchain.com/classes/langchain_document_loaders_fs_unstructured.UnstructuredDirectoryLoader.html), which inherits from [`DirectoryLoader`](/docs/integrations/document_loaders/file_loaders/directory): - -import DirectoryExample from "@examples/document_loaders/unstructured_directory.ts"; - -{DirectoryExample} From 4ffa2121e7cdaf87589bd0b224f0b67f836aa1dc Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Fri, 2 Aug 2024 12:22:53 -0700 Subject: [PATCH 5/8] docs[minor]: updated TextLoader doc (#6343) * docs[minor]: updated TextLoader doc * fix url * cr --- .../document_loaders/file_loaders/text.ipynb | 164 ++++++++++++++++++ .../document_loaders/file_loaders/text.mdx | 15 -- 2 files changed, 164 insertions(+), 15 deletions(-) create mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/text.ipynb delete mode 100644 docs/core_docs/docs/integrations/document_loaders/file_loaders/text.mdx diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/text.ipynb b/docs/core_docs/docs/integrations/document_loaders/file_loaders/text.ipynb new file mode 100644 index 000000000000..bf6c6de8d823 --- /dev/null +++ b/docs/core_docs/docs/integrations/document_loaders/file_loaders/text.ipynb @@ -0,0 +1,164 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "sidebar_label: TextLoader\n", + "sidebar_class_name: node-only\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TextLoader\n", + "\n", + "```{=mdx}\n", + "\n", + ":::tip Compatibility\n", + "\n", + "Only available on Node.js.\n", + "\n", + ":::\n", + "\n", + "```\n", + "\n", + "This notebook provides a quick overview for getting started with `TextLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `TextLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_document_loaders_fs_text.TextLoader.html).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Compatibility | Local | PY support | \n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [TextLoader](https://api.js.langchain.com/classes/langchain_document_loaders_fs_text.TextLoader.html) | [langchain](https://api.js.langchain.com/modules/langchain_document_loaders_fs_text.html) | Node-only | ✅ | ❌ |\n", + "\n", + "## Setup\n", + "\n", + "To access `TextLoader` document loader you'll need to install the `langchain` package.\n", + "\n", + "### Installation\n", + "\n", + "The LangChain TextLoader integration lives in the `langchain` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " langchain\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import { TextLoader } from \"langchain/document_loaders/fs/text\"\n", + "\n", + "const loader = new TextLoader(\"../../../../../../examples/src/document_loaders/example_data/example.txt\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document {\n", + " pageContent: 'Foo\\nBar\\nBaz\\n\\n',\n", + " metadata: {\n", + " source: '../../../../../../examples/src/document_loaders/example_data/example.txt'\n", + " },\n", + " id: undefined\n", + "}\n" + ] + } + ], + "source": [ + "const docs = await loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " source: '../../../../../../examples/src/document_loaders/example_data/example.txt'\n", + "}\n" + ] + } + ], + "source": [ + "console.log(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all TextLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_document_loaders_fs_text.TextLoader.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/core_docs/docs/integrations/document_loaders/file_loaders/text.mdx b/docs/core_docs/docs/integrations/document_loaders/file_loaders/text.mdx deleted file mode 100644 index d20d7c1942d2..000000000000 --- a/docs/core_docs/docs/integrations/document_loaders/file_loaders/text.mdx +++ /dev/null @@ -1,15 +0,0 @@ ---- -hide_table_of_contents: true ---- - -# Text files - -This example goes over how to load data from text files. - -```typescript -import { TextLoader } from "langchain/document_loaders/fs/text"; - -const loader = new TextLoader("src/document_loaders/example_data/example.txt"); - -const docs = await loader.load(); -``` From 3f76d142a07afdda0a2d26fbb0679da99bcf8ae5 Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Fri, 2 Aug 2024 13:47:26 -0700 Subject: [PATCH 6/8] docs[minor]: Updated AWS Knowledge retriever doc (#6352) --- .../retrievers/bedrock-knowledge-bases.ipynb | 273 ++++++++++++++++++ .../retrievers/bedrock-knowledge-bases.mdx | 26 -- 2 files changed, 273 insertions(+), 26 deletions(-) create mode 100644 docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.ipynb delete mode 100644 docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.mdx diff --git a/docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.ipynb b/docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.ipynb new file mode 100644 index 000000000000..fbf57c6eb66a --- /dev/null +++ b/docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.ipynb @@ -0,0 +1,273 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "afaf8039", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "sidebar_label: Knowledge Bases for Amazon Bedrock\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "e49f1e0d", + "metadata": {}, + "source": [ + "# Knowledge Bases for Amazon Bedrock\n", + "\n", + "## Overview\n", + "\n", + "This will help you getting started with the [AmazonKnowledgeBaseRetriever](/docs/concepts/#retrievers). For detailed documentation of all AmazonKnowledgeBaseRetriever features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_aws.AmazonKnowledgeBaseRetriever.html).\n", + "\n", + "Knowledge Bases for Amazon Bedrock is a fully managed support for end-to-end RAG workflow provided by Amazon Web Services (AWS).\n", + "It provides an entire ingestion workflow of converting your documents into embeddings (vector) and storing the embeddings in a specialized vector database.\n", + "Knowledge Bases for Amazon Bedrock supports popular databases for vector storage, including vector engine for Amazon OpenSearch Serverless, Pinecone, Redis Enterprise Cloud, Amazon Aurora (coming soon), and MongoDB (coming soon).\n", + "\n", + "### Integration details\n", + "\n", + "| Retriever | Self-host | Cloud offering | Package | [Py support](https://python.langchain.com/docs/integrations/retrievers/bedrock/) |\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "[AmazonKnowledgeBaseRetriever](https://api.js.langchain.com/classes/langchain_aws.AmazonKnowledgeBaseRetriever.html) | 🟠 (see details below) | ✅ | @langchain/aws | ✅ |\n", + "\n", + "> AWS Knowledge Base Retriever can be 'self hosted' in the sense you can run it on your own AWS infrastructure. However it is not possible to run on another cloud provider or on-premises.\n", + "\n", + "## Setup\n", + "\n", + "In order to use the AmazonKnowledgeBaseRetriever, you need to have an AWS account, where you can manage your indexes and documents. Once you've setup your account, set the following environment variables:\n", + "\n", + "```bash\n", + "process.env.AWS_KNOWLEDGE_BASE_ID=your-knowledge-base-id\n", + "process.env.AWS_ACCESS_KEY_ID=your-access-key-id\n", + "process.env.AWS_SECRET_ACCESS_KEY=your-secret-access-key\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "72ee0c4b-9764-423a-9dbf-95129e185210", + "metadata": {}, + "source": [ + "If you want to get automated tracing from individual queries, you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a15d341e-3e26-4ca3-830b-5aab30ed66de", + "metadata": {}, + "outputs": [], + "source": [ + "// process.env.LANGSMITH_API_KEY = \"\";\n", + "// process.env.LANGSMITH_TRACING = \"true\";" + ] + }, + { + "cell_type": "markdown", + "id": "0730d6a1-c893-4840-9817-5e5251676d5d", + "metadata": {}, + "source": [ + "### Installation\n", + "\n", + "This retriever lives in the `@langchain/aws` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " @langchain/aws\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "a38cde65-254d-4219-a441-068766c0d4b5", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our retriever:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70cc8e65-2a02-408a-bbc6-8ef649057d82", + "metadata": {}, + "outputs": [], + "source": [ + "import { AmazonKnowledgeBaseRetriever } from \"@langchain/aws\";\n", + "\n", + "const retriever = new AmazonKnowledgeBaseRetriever({\n", + " topK: 10,\n", + " knowledgeBaseId: process.env.AWS_KNOWLEDGE_BASE_ID,\n", + " region: \"us-east-2\",\n", + " clientOptions: {\n", + " credentials: {\n", + " accessKeyId: process.env.AWS_ACCESS_KEY_ID,\n", + " secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,\n", + " },\n", + " },\n", + "});" + ] + }, + { + "cell_type": "markdown", + "id": "5c5f2839-4020-424e-9fc9-07777eede442", + "metadata": {}, + "source": [ + "## Usage" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51a60dbe-9f2e-4e04-bb62-23968f17164a", + "metadata": {}, + "outputs": [], + "source": [ + "const query = \"...\"\n", + "\n", + "await retriever.invoke(query);" + ] + }, + { + "cell_type": "markdown", + "id": "dfe8aad4-8626-4330-98a9-7ea1ca5d2e0e", + "metadata": {}, + "source": [ + "## Use within a chain\n", + "\n", + "Like other retrievers, AmazonKnowledgeBaseRetriever can be incorporated into LLM applications via [chains](/docs/how_to/sequence/).\n", + "\n", + "We will need a LLM or chat model:\n", + "\n", + "```{=mdx}\n", + "import ChatModelTabs from \"@theme/ChatModelTabs\";\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25b647a3-f8f2-4541-a289-7a241e43f9df", + "metadata": {}, + "outputs": [], + "source": [ + "// @ls-docs-hide-cell\n", + "\n", + "import { ChatOpenAI } from \"@langchain/openai\";\n", + "\n", + "const llm = new ChatOpenAI({\n", + " model: \"gpt-4o-mini\",\n", + " temperature: 0,\n", + "});" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23e11cc9-abd6-4855-a7eb-799f45ca01ae", + "metadata": {}, + "outputs": [], + "source": [ + "import { ChatPromptTemplate } from \"@langchain/core/prompts\";\n", + "import { RunnablePassthrough, RunnableSequence } from \"@langchain/core/runnables\";\n", + "import { StringOutputParser } from \"@langchain/core/output_parsers\";\n", + "\n", + "import type { Document } from \"@langchain/core/documents\";\n", + "\n", + "const prompt = ChatPromptTemplate.fromTemplate(`\n", + "Answer the question based only on the context provided.\n", + "\n", + "Context: {context}\n", + "\n", + "Question: {question}`);\n", + "\n", + "const formatDocs = (docs: Document[]) => {\n", + " return docs.map((doc) => doc.pageContent).join(\"\\n\\n\");\n", + "}\n", + "\n", + "// See https://js.langchain.com/v0.2/docs/tutorials/rag\n", + "const ragChain = RunnableSequence.from([\n", + " {\n", + " context: retriever.pipe(formatDocs),\n", + " question: new RunnablePassthrough(),\n", + " },\n", + " prompt,\n", + " llm,\n", + " new StringOutputParser(),\n", + "]);" + ] + }, + { + "cell_type": "markdown", + "id": "22b1d6f8", + "metadata": {}, + "source": [ + "```{=mdx}\n", + "\n", + ":::tip\n", + "\n", + "See [our RAG tutorial](docs/tutorials/rag) for more information and examples on `RunnableSequence`'s like the one above.\n", + "\n", + ":::\n", + "\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d47c37dd-5c11-416c-a3b6-bec413cd70e8", + "metadata": {}, + "outputs": [], + "source": [ + "await ragChain.invoke(\"...\")" + ] + }, + { + "cell_type": "markdown", + "id": "3a5bb5ca-c3ae-4a58-be67-2cd18574b9a3", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all AmazonKnowledgeBaseRetriever features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_aws.AmazonKnowledgeBaseRetriever.html)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "typescript", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 + } + \ No newline at end of file diff --git a/docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.mdx b/docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.mdx deleted file mode 100644 index 01cb48c6e1af..000000000000 --- a/docs/core_docs/docs/integrations/retrievers/bedrock-knowledge-bases.mdx +++ /dev/null @@ -1,26 +0,0 @@ ---- -hide_table_of_contents: true ---- - -# Knowledge Bases for Amazon Bedrock - -Knowledge Bases for Amazon Bedrock is a fully managed support for end-to-end RAG workflow provided by Amazon Web Services (AWS). -It provides an entire ingestion workflow of converting your documents into embeddings (vector) and storing the embeddings in a specialized vector database. -Knowledge Bases for Amazon Bedrock supports popular databases for vector storage, including vector engine for Amazon OpenSearch Serverless, Pinecone, Redis Enterprise Cloud, Amazon Aurora (coming soon), and MongoDB (coming soon). - -## Setup - -import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx"; - - - -```bash npm2yarn -npm i @langchain/aws -``` - -## Usage - -import CodeBlock from "@theme/CodeBlock"; -import Example from "@examples/retrievers/amazon_knowledge_bases.ts"; - -{Example} From dead673f9f252882a06ea3c05952c9fd5a1c4fc5 Mon Sep 17 00:00:00 2001 From: "marsal.sans" Date: Fri, 2 Aug 2024 22:51:08 +0200 Subject: [PATCH 7/8] google-common[patch]: Add `method` property to GoogleAISafetySetting interface (#6310) * feat: Add method property to GoogleAISafetySetting interface * fix make param optional --- libs/langchain-google-common/src/types.ts | 1 + 1 file changed, 1 insertion(+) diff --git a/libs/langchain-google-common/src/types.ts b/libs/langchain-google-common/src/types.ts index a07c32e67555..305b7a7eda78 100644 --- a/libs/langchain-google-common/src/types.ts +++ b/libs/langchain-google-common/src/types.ts @@ -45,6 +45,7 @@ export interface GoogleConnectionParams export interface GoogleAISafetySetting { category: string; threshold: string; + method?: string; } export type GoogleAIResponseMimeType = "text/plain" | "application/json"; From 3f77b2136f885029ce3a7aa7f16b11cb4a8f1887 Mon Sep 17 00:00:00 2001 From: Jacob Lee Date: Fri, 2 Aug 2024 14:04:18 -0700 Subject: [PATCH 8/8] docs[patch]: Adds text embeddings template (#6348) * Adds text embeddings template * Copy * update cli templatea and add openai * additional cells and drop old doc * chore: lint files * cr --------- Co-authored-by: bracesproul --- .../integrations/text_embedding/openai.ipynb | 418 ++++++++++++++++++ .../integrations/text_embedding/openai.mdx | 77 ---- .../src/cli/docs/embeddings.ts | 186 ++++++++ libs/langchain-scripts/src/cli/docs/index.ts | 10 +- .../cli/docs/templates/text_embedding.ipynb | 228 ++++++++++ 5 files changed, 841 insertions(+), 78 deletions(-) create mode 100644 docs/core_docs/docs/integrations/text_embedding/openai.ipynb delete mode 100644 docs/core_docs/docs/integrations/text_embedding/openai.mdx create mode 100644 libs/langchain-scripts/src/cli/docs/embeddings.ts create mode 100644 libs/langchain-scripts/src/cli/docs/templates/text_embedding.ipynb diff --git a/docs/core_docs/docs/integrations/text_embedding/openai.ipynb b/docs/core_docs/docs/integrations/text_embedding/openai.ipynb new file mode 100644 index 000000000000..3c820eec3d55 --- /dev/null +++ b/docs/core_docs/docs/integrations/text_embedding/openai.ipynb @@ -0,0 +1,418 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "afaf8039", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "sidebar_label: OpenAI\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "9a3d6f34", + "metadata": {}, + "source": [ + "# OpenAI\n", + "\n", + "This will help you get started with OpenAIEmbeddings [embedding models](/docs/concepts#embedding-models) using LangChain. For detailed documentation on `OpenAIEmbeddings` features and configuration options, please refer to the [API reference](https://api.js.langchain.com/classes/langchain_openai.OpenAIEmbeddings.html).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | [Py support](https://python.langchain.com/docs/integrations/text_embedding/openai/) | Package downloads | Package latest |\n", + "| :--- | :--- | :---: | :---: | :---: | :---: |\n", + "| [OpenAIEmbeddings](https://api.js.langchain.com/classes/langchain_openai.OpenAIEmbeddings.html) | [@langchain/openai](https://api.js.langchain.com/modules/langchain_openai.html) | ❌ | ✅ | ![NPM - Downloads](https://img.shields.io/npm/dm/@langchain/openai?style=flat-square&label=%20&) | ![NPM - Version](https://img.shields.io/npm/v/@langchain/openai?style=flat-square&label=%20&) |\n", + "\n", + "## Setup\n", + "\n", + "To access OpenAIEmbeddings embedding models you'll need to create an OpenAI account, get an API key, and install the `@langchain/openai` integration package.\n", + "\n", + "### Credentials\n", + "\n", + "Head to [platform.openai.com](https://platform.openai.com) to sign up to OpenAI and generate an API key. Once you've done this set the `OPENAI_API_KEY` environment variable:\n", + "\n", + "```bash\n", + "export OPENAI_API_KEY=\"your-api-key\"\n", + "```\n", + "\n", + "If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:\n", + "\n", + "```bash\n", + "# export LANGCHAIN_TRACING_V2=\"true\"\n", + "# export LANGCHAIN_API_KEY=\"your-api-key\"\n", + "```\n", + "\n", + "### Installation\n", + "\n", + "The LangChain OpenAIEmbeddings integration lives in the `@langchain/openai` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " @langchain/openai\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "45dd1724", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and generate chat completions:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "9ea7a09b", + "metadata": {}, + "outputs": [], + "source": [ + "import { OpenAIEmbeddings } from \"@langchain/openai\";\n", + "\n", + "const embeddings = new OpenAIEmbeddings({\n", + " apiKey: \"YOUR-API-KEY\", // In Node.js defaults to process.env.OPENAI_API_KEY\n", + " batchSize: 512, // Default value if omitted is 512. Max is 2048\n", + " model: \"text-embedding-3-large\",\n", + "});" + ] + }, + { + "cell_type": "markdown", + "id": "fb4153d3", + "metadata": {}, + "source": [ + "If you're part of an organization, you can set `process.env.OPENAI_ORGANIZATION` to your OpenAI organization id, or pass it in as `organization` when\n", + "initializing the model." + ] + }, + { + "cell_type": "markdown", + "id": "77d271b6", + "metadata": {}, + "source": [ + "## Indexing and Retrieval\n", + "\n", + "Embedding models are often used in retrieval-augmented generation (RAG) flows, both as part of indexing data as well as later retrieving it. For more detailed instructions, please see our RAG tutorials under the [working with external knowledge tutorials](/docs/tutorials/#working-with-external-knowledge).\n", + "\n", + "Below, see how to index and retrieve data using the `embeddings` object we initialized above. In this example, we will index and retrieve a sample document using the demo [`MemoryVectorStore`](/docs/integrations/vectorstores/memory)." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d817716b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "LangChain is the framework for building context-aware reasoning applications\n" + ] + } + ], + "source": [ + "// Create a vector store with a sample text\n", + "import { MemoryVectorStore } from \"langchain/vectorstores/memory\";\n", + "\n", + "const text = \"LangChain is the framework for building context-aware reasoning applications\";\n", + "\n", + "const vectorstore = await MemoryVectorStore.fromDocuments(\n", + " [{ pageContent: text, metadata: {} }],\n", + " embeddings,\n", + ");\n", + "\n", + "// Use the vector store as a retriever that returns a single document\n", + "const retriever = vectorstore.asRetriever(1);\n", + "\n", + "// Retrieve the most similar text\n", + "const retrievedDocuments = await retriever.invoke(\"What is LangChain?\");\n", + "\n", + "retrievedDocuments[0].pageContent;" + ] + }, + { + "cell_type": "markdown", + "id": "e02b9855", + "metadata": {}, + "source": [ + "## Direct Usage\n", + "\n", + "Under the hood, the vectorstore and retriever implementations are calling `embeddings.embedDocument(...)` and `embeddings.embedQuery(...)` to create embeddings for the text(s) used in `fromDocuments` and the retriever's `invoke` operations, respectively.\n", + "\n", + "You can directly call these methods to get embeddings for your own use cases.\n", + "\n", + "### Embed single texts\n", + "\n", + "You can embed queries for search with `embedQuery`. This generates a vector representation specific to the query:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "0d2befcd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[\n", + " -0.01927683, 0.0037708976, -0.032942563, 0.0037671267, 0.008175306,\n", + " -0.012511838, -0.009713832, 0.021403614, -0.015377721, 0.0018684798,\n", + " 0.020574018, 0.022399133, -0.02322873, -0.01524951, -0.00504169,\n", + " -0.007375876, -0.03448109, 0.00015130726, 0.021388533, -0.012564631,\n", + " -0.020031009, 0.027406884, -0.039217334, 0.03036327, 0.030393435,\n", + " -0.021750538, 0.032610722, -0.021162277, -0.025898525, 0.018869571,\n", + " 0.034179416, -0.013371604, 0.0037652412, -0.02146395, 0.0012641934,\n", + " -0.055688616, 0.05104287, 0.0024982197, -0.019095825, 0.0037369595,\n", + " 0.00088757504, 0.025189597, -0.018779071, 0.024978427, 0.016833287,\n", + " -0.0025868358, -0.011727491, -0.0021154736, -0.017738303, 0.0013839195,\n", + " -0.0131151825, -0.05405959, 0.029729757, -0.003393808, 0.019774588,\n", + " 0.028885076, 0.004355387, 0.026094612, 0.06479911, 0.038040817,\n", + " -0.03478276, -0.012594799, -0.024767255, -0.0031430433, 0.017874055,\n", + " -0.015294761, 0.005709139, 0.025355516, 0.044798266, 0.02549127,\n", + " -0.02524993, 0.00014553308, -0.019427665, -0.023545485, 0.008748483,\n", + " 0.019850006, -0.028417485, -0.001860938, -0.02318348, -0.010799851,\n", + " 0.04793565, -0.0048983963, 0.02193154, -0.026411368, 0.026426451,\n", + " -0.012149832, 0.035355937, -0.047814984, -0.027165547, -0.008228099,\n", + " -0.007737882, 0.023726488, -0.046487626, -0.007783133, -0.019638835,\n", + " 0.01793439, -0.018024892, 0.0030336871, -0.019578502, 0.0042837397\n", + "]\n" + ] + } + ], + "source": [ + "const singleVector = await embeddings.embedQuery(text);\n", + "\n", + "console.log(singleVector.slice(0, 100));" + ] + }, + { + "cell_type": "markdown", + "id": "1b5a7d03", + "metadata": {}, + "source": [ + "### Embed multiple texts\n", + "\n", + "You can embed multiple texts for indexing with `embedDocuments`. The internals used for this method may (but do not have to) differ from embedding queries:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "2f4d6e97", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[\n", + " -0.01927683, 0.0037708976, -0.032942563, 0.0037671267, 0.008175306,\n", + " -0.012511838, -0.009713832, 0.021403614, -0.015377721, 0.0018684798,\n", + " 0.020574018, 0.022399133, -0.02322873, -0.01524951, -0.00504169,\n", + " -0.007375876, -0.03448109, 0.00015130726, 0.021388533, -0.012564631,\n", + " -0.020031009, 0.027406884, -0.039217334, 0.03036327, 0.030393435,\n", + " -0.021750538, 0.032610722, -0.021162277, -0.025898525, 0.018869571,\n", + " 0.034179416, -0.013371604, 0.0037652412, -0.02146395, 0.0012641934,\n", + " -0.055688616, 0.05104287, 0.0024982197, -0.019095825, 0.0037369595,\n", + " 0.00088757504, 0.025189597, -0.018779071, 0.024978427, 0.016833287,\n", + " -0.0025868358, -0.011727491, -0.0021154736, -0.017738303, 0.0013839195,\n", + " -0.0131151825, -0.05405959, 0.029729757, -0.003393808, 0.019774588,\n", + " 0.028885076, 0.004355387, 0.026094612, 0.06479911, 0.038040817,\n", + " -0.03478276, -0.012594799, -0.024767255, -0.0031430433, 0.017874055,\n", + " -0.015294761, 0.005709139, 0.025355516, 0.044798266, 0.02549127,\n", + " -0.02524993, 0.00014553308, -0.019427665, -0.023545485, 0.008748483,\n", + " 0.019850006, -0.028417485, -0.001860938, -0.02318348, -0.010799851,\n", + " 0.04793565, -0.0048983963, 0.02193154, -0.026411368, 0.026426451,\n", + " -0.012149832, 0.035355937, -0.047814984, -0.027165547, -0.008228099,\n", + " -0.007737882, 0.023726488, -0.046487626, -0.007783133, -0.019638835,\n", + " 0.01793439, -0.018024892, 0.0030336871, -0.019578502, 0.0042837397\n", + "]\n", + "[\n", + " -0.010181213, 0.023419594, -0.04215527, -0.0015320902, -0.023573855,\n", + " -0.0091644935, -0.014893179, 0.019016149, -0.023475688, 0.0010219777,\n", + " 0.009255648, 0.03996757, -0.04366983, -0.01640774, -0.020194141,\n", + " 0.019408813, -0.027977299, -0.022017224, 0.013539891, -0.007769135,\n", + " 0.032647192, -0.015089511, -0.022900717, 0.023798235, 0.026084099,\n", + " -0.024625633, 0.035003178, -0.017978394, -0.049615882, 0.013364594,\n", + " 0.031132633, 0.019142363, 0.023195215, -0.038396914, 0.005584942,\n", + " -0.031946007, 0.053682756, -0.0036356465, 0.011240003, 0.0056690844,\n", + " -0.0062791156, 0.044146635, -0.037387207, 0.01300699, 0.018946031,\n", + " 0.0050415234, 0.029618073, -0.021750772, -0.000649473, 0.00026951815,\n", + " -0.014710871, -0.029814405, 0.04204308, -0.014710871, 0.0039616977,\n", + " -0.021512369, 0.054608323, 0.021484323, 0.02790718, -0.010573876,\n", + " -0.023952495, -0.035143413, -0.048802506, -0.0075798146, 0.023279356,\n", + " -0.022690361, -0.016590048, 0.0060477243, 0.014100839, 0.005476258,\n", + " -0.017221114, -0.0100059165, -0.017922299, -0.021989176, 0.01830094,\n", + " 0.05516927, 0.001033372, 0.0017310516, -0.00960624, -0.037864015,\n", + " 0.013063084, 0.006591143, -0.010160177, 0.0011394264, 0.04953174,\n", + " 0.004806626, 0.029421741, -0.037751824, 0.003618117, 0.007162609,\n", + " 0.027696826, -0.0021070621, -0.024485396, -0.0042141243, -0.02801937,\n", + " -0.019605145, 0.016281527, -0.035143413, 0.01640774, 0.042323552\n", + "]\n" + ] + } + ], + "source": [ + "const text2 = \"LangGraph is a library for building stateful, multi-actor applications with LLMs\";\n", + "\n", + "const vectors = await embeddings.embedDocuments([text, text2]);\n", + "\n", + "console.log(vectors[0].slice(0, 100));\n", + "console.log(vectors[1].slice(0, 100));" + ] + }, + { + "cell_type": "markdown", + "id": "2b1a3527", + "metadata": {}, + "source": [ + "## Specifying dimensions\n", + "\n", + "With the `text-embedding-3` class of models, you can specify the size of the embeddings you want returned. For example by default `text-embedding-3-large` returns embeddings of dimension 3072:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a611fe1a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "3072\n" + ] + } + ], + "source": [ + "import { OpenAIEmbeddings } from \"@langchain/openai\";\n", + "\n", + "const embeddingsDefaultDimensions = new OpenAIEmbeddings({\n", + " model: \"text-embedding-3-large\",\n", + "});\n", + "\n", + "const vectorsDefaultDimensions = await embeddingsDefaultDimensions.embedDocuments([\"some text\"]);\n", + "console.log(vectorsDefaultDimensions[0].length);" + ] + }, + { + "cell_type": "markdown", + "id": "08efe771", + "metadata": {}, + "source": [ + "But by passing in `dimensions: 1024` we can reduce the size of our embeddings to 1024:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "19667fdb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1024\n" + ] + } + ], + "source": [ + "import { OpenAIEmbeddings } from \"@langchain/openai\";\n", + "\n", + "const embeddings1024 = new OpenAIEmbeddings({\n", + " model: \"text-embedding-3-large\",\n", + " dimensions: 1024,\n", + "});\n", + "\n", + "const vectors1024 = await embeddings1024.embedDocuments([\"some text\"]);\n", + "console.log(vectors1024[0].length);" + ] + }, + { + "cell_type": "markdown", + "id": "6b84c0df", + "metadata": {}, + "source": [ + "## Custom URLs\n", + "\n", + "You can customize the base URL the SDK sends requests to by passing a `configuration` parameter like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3bfa20a6", + "metadata": {}, + "outputs": [], + "source": [ + "import { OpenAIEmbeddings } from \"@langchain/openai\";\n", + "\n", + "const model = new OpenAIEmbeddings({\n", + " configuration: {\n", + " baseURL: \"https://your_custom_url.com\",\n", + " },\n", + "});" + ] + }, + { + "cell_type": "markdown", + "id": "ac3cac9b", + "metadata": {}, + "source": [ + "You can also pass other `ClientOptions` parameters accepted by the official SDK.\n", + "\n", + "If you are hosting on Azure OpenAI, see the [dedicated page instead](/docs/integrations/text_embedding/azure_openai)." + ] + }, + { + "cell_type": "markdown", + "id": "8938e581", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all OpenAIEmbeddings features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_openai.OpenAIEmbeddings.html" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/core_docs/docs/integrations/text_embedding/openai.mdx b/docs/core_docs/docs/integrations/text_embedding/openai.mdx deleted file mode 100644 index bba94b0777ee..000000000000 --- a/docs/core_docs/docs/integrations/text_embedding/openai.mdx +++ /dev/null @@ -1,77 +0,0 @@ ---- -keywords: [openaiembeddings] ---- - -# OpenAI - -The `OpenAIEmbeddings` class uses the OpenAI API to generate embeddings for a given text. By default it strips new line characters from the text, as recommended by OpenAI, but you can disable this by passing `stripNewLines: false` to the constructor. - -import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx"; - - - -```bash npm2yarn -npm install @langchain/openai -``` - -```typescript -import { OpenAIEmbeddings } from "@langchain/openai"; - -const embeddings = new OpenAIEmbeddings({ - apiKey: "YOUR-API-KEY", // In Node.js defaults to process.env.OPENAI_API_KEY - batchSize: 512, // Default value if omitted is 512. Max is 2048 - model: "text-embedding-3-large", -}); -``` - -If you're part of an organization, you can set `process.env.OPENAI_ORGANIZATION` to your OpenAI organization id, or pass it in as `organization` when -initializing the model. - -## Specifying dimensions - -With the `text-embedding-3` class of models, you can specify the size of the embeddings you want returned. For example by default `text-embedding-3-large` returns embeddings of dimension 3072: - -```typescript -const embeddings = new OpenAIEmbeddings({ - model: "text-embedding-3-large", -}); - -const vectors = await embeddings.embedDocuments(["some text"]); -console.log(vectors[0].length); -``` - -``` -3072 -``` - -But by passing in `dimensions: 1024` we can reduce the size of our embeddings to 1024: - -```typescript -const embeddings1024 = new OpenAIEmbeddings({ - model: "text-embedding-3-large", - dimensions: 1024, -}); - -const vectors2 = await embeddings1024.embedDocuments(["some text"]); -console.log(vectors2[0].length); -``` - -``` -1024 -``` - -## Custom URLs - -You can customize the base URL the SDK sends requests to by passing a `configuration` parameter like this: - -```typescript -const model = new OpenAIEmbeddings({ - configuration: { - baseURL: "https://your_custom_url.com", - }, -}); -``` - -You can also pass other `ClientOptions` parameters accepted by the official SDK. - -If you are hosting on Azure OpenAI, see the [dedicated page instead](/docs/integrations/text_embedding/azure_openai). diff --git a/libs/langchain-scripts/src/cli/docs/embeddings.ts b/libs/langchain-scripts/src/cli/docs/embeddings.ts new file mode 100644 index 000000000000..03e1cc96f99e --- /dev/null +++ b/libs/langchain-scripts/src/cli/docs/embeddings.ts @@ -0,0 +1,186 @@ +import * as path from "node:path"; +import * as fs from "node:fs"; +import { + boldText, + getUserInput, + greenText, + redBackground, +} from "../utils/get-input.js"; + +const PACKAGE_NAME_PLACEHOLDER = "__package_name__"; +const MODULE_NAME_PLACEHOLDER = "__ModuleName__"; +const SIDEBAR_LABEL_PLACEHOLDER = "__sidebar_label__"; +const FULL_IMPORT_PATH_PLACEHOLDER = "__full_import_path__"; +const LOCAL_PLACEHOLDER = "__local__"; +const PY_SUPPORT_PLACEHOLDER = "__py_support__"; +const ENV_VAR_NAME_PLACEHOLDER = "__env_var_name__"; +const API_REF_MODULE_PLACEHOLDER = "__api_ref_module__"; +const API_REF_PACKAGE_PLACEHOLDER = "__api_ref_package__"; +const PYTHON_DOC_URL_PLACEHOLDER = "__python_doc_url__"; + +const TEMPLATE_PATH = path.resolve( + "./src/cli/docs/templates/text_embedding.ipynb" +); +const INTEGRATIONS_DOCS_PATH = path.resolve( + "../../docs/core_docs/docs/integrations/text_embedding" +); + +const fetchAPIRefUrl = async (url: string): Promise => { + try { + const res = await fetch(url); + if (res.status !== 200) { + throw new Error(`API Reference URL ${url} not found.`); + } + return true; + } catch (_) { + return false; + } +}; + +type ExtraFields = { + local: boolean; + pySupport: boolean; + packageName: string; + fullImportPath?: string; + envVarName: string; +}; + +async function promptExtraFields(fields: { + envVarGuess: string; + packageNameGuess: string; + isCommunity: boolean; +}): Promise { + const { envVarGuess, packageNameGuess, isCommunity } = fields; + const canRunLocally = await getUserInput( + "Does this embeddings model support local usage? (y/n) ", + undefined, + true + ); + const hasPySupport = await getUserInput( + "Does this integration have Python support? (y/n) ", + undefined, + true + ); + + let packageName = packageNameGuess; + if (!isCommunity) { + // If it's not community, get the package name. + + const isOtherPackageName = await getUserInput( + `Is this integration part of the ${packageNameGuess} package? (y/n) ` + ); + if (isOtherPackageName.toLowerCase() === "n") { + packageName = await getUserInput( + "What is the name of the package this integration is located in? (e.g @langchain/openai) ", + undefined, + true + ); + if ( + !packageName.startsWith("@langchain/") && + !packageName.startsWith("langchain/") + ) { + packageName = await getUserInput( + "Packages must start with either '@langchain/' or 'langchain/'. Please enter a valid package name: ", + undefined, + true + ); + } + } + } + + // If it's community or langchain, ask for the full import path + let fullImportPath: string | undefined; + if ( + packageName.startsWith("@langchain/community") || + packageName.startsWith("langchain/") + ) { + fullImportPath = await getUserInput( + "What is the full import path of the package? (e.g '@langchain/community/embeddings/togetherai') ", + undefined, + true + ); + } + + const envVarName = await getUserInput( + `Is the environment variable for the API key named ${envVarGuess}? If it is, reply with 'y', else reply with the correct name: `, + undefined, + true + ); + + return { + local: canRunLocally.toLowerCase() === "y", + pySupport: hasPySupport.toLowerCase() === "y", + packageName, + fullImportPath, + envVarName: + envVarName.toLowerCase() === "y" ? envVarGuess : envVarName.toUpperCase(), + }; +} + +export async function fillEmbeddingsIntegrationDocTemplate(fields: { + packageName: string; + moduleName: string; + isCommunity: boolean; +}) { + const sidebarLabel = fields.moduleName.replace("Embeddings", ""); + const pyDocUrl = `https://python.langchain.com/docs/integrations/text_embedding/${sidebarLabel.toLowerCase()}/`; + let envVarName = `${sidebarLabel.toUpperCase()}_API_KEY`; + const extraFields = await promptExtraFields({ + packageNameGuess: `@langchain/${fields.packageName}`, + envVarGuess: envVarName, + isCommunity: fields.isCommunity, + }); + envVarName = extraFields.envVarName; + const { pySupport } = extraFields; + const localSupport = extraFields.local; + const { packageName } = extraFields; + const fullImportPath = extraFields.fullImportPath ?? extraFields.packageName; + + const apiRefModuleUrl = `https://api.js.langchain.com/classes/${fullImportPath + .replace("@", "") + .replaceAll("/", "_") + .replaceAll("-", "_")}.${fields.moduleName}.html`; + const apiRefPackageUrl = apiRefModuleUrl + .replace("/classes/", "/modules/") + .replace(`.${fields.moduleName}.html`, ".html"); + + const apiRefUrlSuccesses = await Promise.all([ + fetchAPIRefUrl(apiRefModuleUrl), + fetchAPIRefUrl(apiRefPackageUrl), + ]); + if (apiRefUrlSuccesses.find((s) => !s)) { + console.warn( + "API ref URLs invalid. Please manually ensure they are correct." + ); + } + + const docTemplate = (await fs.promises.readFile(TEMPLATE_PATH, "utf-8")) + .replaceAll(PACKAGE_NAME_PLACEHOLDER, packageName) + .replaceAll(MODULE_NAME_PLACEHOLDER, fields.moduleName) + .replaceAll(SIDEBAR_LABEL_PLACEHOLDER, sidebarLabel) + .replaceAll(FULL_IMPORT_PATH_PLACEHOLDER, fullImportPath) + .replaceAll(LOCAL_PLACEHOLDER, localSupport ? "✅" : "❌") + .replaceAll(PY_SUPPORT_PLACEHOLDER, pySupport ? "✅" : "❌") + .replaceAll(ENV_VAR_NAME_PLACEHOLDER, envVarName) + .replaceAll(API_REF_MODULE_PLACEHOLDER, apiRefModuleUrl) + .replaceAll(API_REF_PACKAGE_PLACEHOLDER, apiRefPackageUrl) + .replaceAll(PYTHON_DOC_URL_PLACEHOLDER, pyDocUrl); + + const docFileName = fullImportPath.split("/").pop(); + const docPath = path.join(INTEGRATIONS_DOCS_PATH, `${docFileName}.ipynb`); + await fs.promises.writeFile(docPath, docTemplate); + const prettyDocPath = docPath.split("docs/core_docs/")[1]; + + const updatePythonDocUrlText = ` ${redBackground( + "- Update the Python documentation URL with the proper URL." + )}`; + const successText = `\nSuccessfully created new chat model integration doc at ${prettyDocPath}.`; + + console.log( + `${greenText(successText)}\n +${boldText("Next steps:")} +${extraFields?.pySupport ? updatePythonDocUrlText : ""} + - Run all code cells in the generated doc to record the outputs. + - Add extra sections on integration specific features.\n` + ); +} diff --git a/libs/langchain-scripts/src/cli/docs/index.ts b/libs/langchain-scripts/src/cli/docs/index.ts index d664a220a240..1baba8b900bf 100644 --- a/libs/langchain-scripts/src/cli/docs/index.ts +++ b/libs/langchain-scripts/src/cli/docs/index.ts @@ -5,6 +5,7 @@ import { Command } from "commander"; import { fillChatIntegrationDocTemplate } from "./chat.js"; import { fillDocLoaderIntegrationDocTemplate } from "./document_loaders.js"; import { fillLLMIntegrationDocTemplate } from "./llms.js"; +import { fillEmbeddingsIntegrationDocTemplate } from "./embeddings.js"; type CLIInput = { package: string; @@ -57,9 +58,16 @@ async function main() { isCommunity, }); break; + case "embeddings": + await fillEmbeddingsIntegrationDocTemplate({ + packageName, + moduleName, + isCommunity, + }); + break; default: console.error( - `Invalid type: ${type}.\nOnly 'chat', 'llm' and 'doc_loader' are supported at this time.` + `Invalid type: ${type}.\nOnly 'chat', 'llm', 'embeddings' and 'doc_loader' are supported at this time.` ); process.exit(1); } diff --git a/libs/langchain-scripts/src/cli/docs/templates/text_embedding.ipynb b/libs/langchain-scripts/src/cli/docs/templates/text_embedding.ipynb new file mode 100644 index 000000000000..48d85f8a3afc --- /dev/null +++ b/libs/langchain-scripts/src/cli/docs/templates/text_embedding.ipynb @@ -0,0 +1,228 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "afaf8039", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "sidebar_label: __sidebar_label__\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "9a3d6f34", + "metadata": {}, + "source": [ + "# __ModuleName__\n", + "\n", + "- [ ] TODO: Make sure API reference link is correct\n", + "\n", + "This will help you get started with __ModuleName__ [embedding models](/docs/concepts#embedding-models) using LangChain. For detailed documentation on `__ModuleName__` features and configuration options, please refer to the [API reference](__api_ref_module__).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "- TODO: Fill in table features.\n", + "- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.\n", + "- TODO: Make sure API reference links are correct.\n", + "\n", + "| Class | Package | Local | [Py support](__python_doc_url__) | Package downloads | Package latest |\n", + "| :--- | :--- | :---: | :---: | :---: | :---: |\n", + "| [__ModuleName__](__api_ref_module__) | [__package_name__](__api_ref_package__) | __local__ | __py_support__ | ![NPM - Downloads](https://img.shields.io/npm/dm/__package_name__?style=flat-square&label=%20&) | ![NPM - Version](https://img.shields.io/npm/v/__package_name__?style=flat-square&label=%20&) |\n", + "\n", + "## Setup\n", + "\n", + "- [ ] TODO: Update with relevant info.\n", + "\n", + "To access __sidebar_label__ embedding models you'll need to create a/an __ModuleName__ account, get an API key, and install the `__package_name__` integration package.\n", + "\n", + "### Credentials\n", + "\n", + "- TODO: Update with relevant info.\n", + "\n", + "Head to (TODO: link) to sign up to `__sidebar_label__` and generate an API key. Once you've done this set the `__env_var_name__` environment variable:\n", + "\n", + "```bash\n", + "export __env_var_name__=\"your-api-key\"\n", + "```\n", + "\n", + "If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:\n", + "\n", + "```bash\n", + "# export LANGCHAIN_TRACING_V2=\"true\"\n", + "# export LANGCHAIN_API_KEY=\"your-api-key\"\n", + "```\n", + "\n", + "### Installation\n", + "\n", + "The LangChain __ModuleName__ integration lives in the `__package_name__` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " __package_name__\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "45dd1724", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and generate chat completions:\n", + "\n", + "- TODO: Update model instantiation with relevant params." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ea7a09b", + "metadata": {}, + "outputs": [], + "source": [ + "import { __ModuleName__ } from \"__full_import_path__\";\n", + "\n", + "const embeddings = new __ModuleName__({\n", + " model: \"model-name\",\n", + " // ...\n", + "});" + ] + }, + { + "cell_type": "markdown", + "id": "77d271b6", + "metadata": {}, + "source": [ + "## Indexing and Retrieval\n", + "\n", + "Embedding models are often used in retrieval-augmented generation (RAG) flows, both as part of indexing data as well as later retrieving it. For more detailed instructions, please see our RAG tutorials under the [working with external knowledge tutorials](/docs/tutorials/#working-with-external-knowledge).\n", + "\n", + "Below, see how to index and retrieve data using the `embeddings` object we initialized above. In this example, we will index and retrieve a sample document using the demo [`MemoryVectorStore`](/docs/integrations/vectorstores/memory)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d817716b", + "metadata": {}, + "outputs": [], + "source": [ + "// Create a vector store with a sample text\n", + "import { MemoryVectorStore } from \"langchain/vectorstores/memory\";\n", + "\n", + "const text = \"LangChain is the framework for building context-aware reasoning applications\";\n", + "\n", + "const vectorstore = await MemoryVectorStore.fromDocuments(\n", + " [{ pageContent: text, metadata: {} }],\n", + " embeddings,\n", + ");\n", + "\n", + "// Use the vector store as a retriever that returns a single document\n", + "const retriever = vectorstore.asRetriever(1);\n", + "\n", + "// Retrieve the most similar text\n", + "const retrievedDocument = await retriever.invoke(\"What is LangChain?\");\n", + "\n", + "retrievedDocument.pageContent;" + ] + }, + { + "cell_type": "markdown", + "id": "e02b9855", + "metadata": {}, + "source": [ + "## Direct Usage\n", + "\n", + "Under the hood, the vectorstore and retriever implementations are calling `embeddings.embedDocument(...)` and `embeddings.embedQuery(...)` to create embeddings for the text(s) used in `fromDocuments` and the retriever's `invoke` operations, respectively.\n", + "\n", + "You can directly call these methods to get embeddings for your own use cases.\n", + "\n", + "### Embed single texts\n", + "\n", + "You can embed queries for search with `embedQuery`. This generates a vector representation specific to the query:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d2befcd", + "metadata": {}, + "outputs": [], + "source": [ + "const singleVector = await embeddings.embedQuery(text);\n", + "\n", + "console.log(singleVector.slice(0, 100));" + ] + }, + { + "cell_type": "markdown", + "id": "1b5a7d03", + "metadata": {}, + "source": [ + "### Embed multiple texts\n", + "\n", + "You can embed multiple texts for indexing with `embedDocuments`. The internals used for this method may (but do not have to) differ from embedding queries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f4d6e97", + "metadata": {}, + "outputs": [], + "source": [ + "const text2 = \"LangGraph is a library for building stateful, multi-actor applications with LLMs\";\n", + "\n", + "const vectors = await embeddings.embedDocuments([text, text2]);\n", + "\n", + "console.log(vectors[0].slice(0, 100));\n", + "console.log(vectors[1].slice(0, 100));" + ] + }, + { + "cell_type": "markdown", + "id": "8938e581", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all __ModuleName__ features and configurations head to the API reference: __api_ref_module__" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "typescript", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}