-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
I recently discovered the same issue that was reported in #2467. The documentation and examples suggest chaining the `CheerioWebBaseLoader` with `MozillaReadabilityTransformer` to load and transform HTML documents. However, the `CheerioWebBaseLoader` uses [Cheerio's `text` method to extract the text](https://github.com/langchain-ai/langchainjs/blob/05e5813715150cd69d9e384924818562e3b7c1fa/libs/langchain-community/src/document_loaders/web/cheerio.ts#L144C34-L144C40) from the HTML document, or provided selector. This is great if you want the text, but the MozillaReadabilityTransformer needs to act on HTML. I have added an `HTMLWebBaseLoader` that simply uses `fetch` to get an HTML document and returns the full HTML content. I've also updated the `MozillaReadabilityTransformer` example to use the `HTMLWebBaseLoader` instead of the `CheerioWebBaseLoader`.
- Loading branch information
Showing
11 changed files
with
129 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
22 changes: 22 additions & 0 deletions
22
libs/langchain-community/src/document_loaders/tests/html.int.test.ts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import { expect, test } from "@jest/globals"; | ||
import { HTMLWebBaseLoader } from "../web/html.js"; | ||
|
||
test("Test HTML web scraper loader", async () => { | ||
const loader = new HTMLWebBaseLoader( | ||
"https://news.ycombinator.com/item?id=34817881" | ||
); | ||
const docs = await loader.load(); | ||
expect(docs[0].pageContent).toEqual(expect.stringContaining("What Lights the Universe’s Standard Candles?")) | ||
}); | ||
|
||
test("Test HTML web scraper loader with textDecoder", async () => { | ||
const loader = new HTMLWebBaseLoader( | ||
"https://corp.163.com/gb/about/management.html", | ||
{ | ||
textDecoder: new TextDecoder("gbk"), | ||
} | ||
); | ||
|
||
const docs = await loader.load(); | ||
expect(docs[0].pageContent.trim()).toEqual(expect.stringContaining("网易")); | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
import { | ||
AsyncCaller, | ||
} from "@langchain/core/utils/async_caller"; | ||
import { BaseDocumentLoader } from "@langchain/core/document_loaders/base"; | ||
import { Document } from "@langchain/core/documents"; | ||
import { WebBaseLoaderParams, WebBaseLoader } from "./web_base_loader.js"; | ||
|
||
export class HTMLWebBaseLoader extends BaseDocumentLoader implements WebBaseLoader { | ||
timeout: number; | ||
|
||
caller: AsyncCaller; | ||
|
||
textDecoder?: TextDecoder; | ||
|
||
headers?: HeadersInit; | ||
|
||
constructor(public webPath: string, fields?: WebBaseLoaderParams) { | ||
super(); | ||
const { timeout, textDecoder, headers, ...rest } = fields ?? {}; | ||
this.timeout = timeout ?? 10000; | ||
this.caller = new AsyncCaller(rest); | ||
this.textDecoder = textDecoder; | ||
this.headers = headers; | ||
} | ||
|
||
async load(): Promise<Document[]> { | ||
const response = await this.caller.call(fetch, this.webPath, { | ||
signal: this.timeout ? AbortSignal.timeout(this.timeout) : undefined, | ||
headers: this.headers, | ||
}); | ||
|
||
const html = | ||
this.textDecoder?.decode(await response.arrayBuffer()) ?? | ||
(await response.text()); | ||
|
||
return [new Document({ pageContent: html })]; | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
36 changes: 36 additions & 0 deletions
36
libs/langchain-community/src/document_loaders/web/web_base_loader.ts
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
import { | ||
AsyncCaller, | ||
AsyncCallerParams, | ||
} from "@langchain/core/utils/async_caller"; | ||
import type { DocumentLoader } from "@langchain/core/document_loaders/base"; | ||
|
||
/** | ||
* Represents the parameters for configuring WebBaseLoaders. It extends the | ||
* AsyncCallerParams interface and adds additional parameters specific to | ||
* web-based loaders. | ||
*/ | ||
export interface WebBaseLoaderParams extends AsyncCallerParams { | ||
/** | ||
* The timeout in milliseconds for the fetch request. Defaults to 10s. | ||
*/ | ||
timeout?: number; | ||
|
||
/** | ||
* The text decoder to use to decode the response. Defaults to UTF-8. | ||
*/ | ||
textDecoder?: TextDecoder; | ||
/** | ||
* The headers to use in the fetch request. | ||
*/ | ||
headers?: HeadersInit; | ||
} | ||
|
||
export interface WebBaseLoader extends DocumentLoader { | ||
timeout: number; | ||
|
||
caller: AsyncCaller; | ||
|
||
textDecoder?: TextDecoder; | ||
|
||
headers?: HeadersInit; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters