Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain[patch],community[patch]: add in-memory option for unstructured loader #5581

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 36 additions & 10 deletions langchain/src/document_loaders/fs/unstructured.ts
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,11 @@ type UnstructuredDirectoryLoaderOptions = UnstructuredLoaderOptions & {
unknown?: UnknownHandling;
};

type UnstructuredMemoryLoaderOptions = {
buffer: Buffer;
fileName: string;
};

/**
* @deprecated - Import from "@langchain/community/document_loaders/fs/unstructured" instead. This entrypoint will be removed in 0.3.0.
*
Expand All @@ -139,6 +144,10 @@ type UnstructuredDirectoryLoaderOptions = UnstructuredLoaderOptions & {
export class UnstructuredLoader extends BaseDocumentLoader {
public filePath: string;

private buffer?: Buffer;

private fileName?: string;

private apiUrl = "https://api.unstructured.io/general/v0/general";

private apiKey?: string;
Expand Down Expand Up @@ -175,19 +184,30 @@ export class UnstructuredLoader extends BaseDocumentLoader {
private maxCharacters?: number;

constructor(
filePathOrLegacyApiUrl: string,
filePathOrLegacyApiUrlOrMemoryBuffer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reeealy need to clean this up 😬

| string
| UnstructuredMemoryLoaderOptions,
optionsOrLegacyFilePath: UnstructuredLoaderOptions | string = {}
) {
super();

// Temporary shim to avoid breaking existing users
// Remove when API keys are enforced by Unstructured and existing code will break anyway
const isLegacySyntax = typeof optionsOrLegacyFilePath === "string";
if (isLegacySyntax) {
const isMemorySyntax =
typeof filePathOrLegacyApiUrlOrMemoryBuffer === "object";

if (isMemorySyntax) {
this.buffer = filePathOrLegacyApiUrlOrMemoryBuffer.buffer;
this.fileName = filePathOrLegacyApiUrlOrMemoryBuffer.fileName;
} else if (isLegacySyntax) {
this.filePath = optionsOrLegacyFilePath;
this.apiUrl = filePathOrLegacyApiUrl;
this.apiUrl = filePathOrLegacyApiUrlOrMemoryBuffer;
} else {
this.filePath = filePathOrLegacyApiUrl;
this.filePath = filePathOrLegacyApiUrlOrMemoryBuffer;
}

if (!isLegacySyntax) {
const options = optionsOrLegacyFilePath;
this.apiKey = options.apiKey;
this.apiUrl = options.apiUrl ?? this.apiUrl;
Expand All @@ -209,14 +229,20 @@ export class UnstructuredLoader extends BaseDocumentLoader {
}

async _partition() {
const { readFile, basename } = await this.imports();
let { buffer } = this;
let { fileName } = this;

if (!buffer) {
const { readFile, basename } = await this.imports();

const buffer = await readFile(this.filePath);
const fileName = basename(this.filePath);
buffer = await readFile(this.filePath);
fileName = basename(this.filePath);

// I'm aware this reads the file into memory first, but we have lots of work
// to do on then consuming Documents in a streaming fashion anyway, so not
// worried about this for now.
}

// I'm aware this reads the file into memory first, but we have lots of work
// to do on then consuming Documents in a streaming fashion anyway, so not
// worried about this for now.
const formData = new FormData();
formData.append("files", new Blob([buffer]), fileName);
formData.append("strategy", this.strategy);
Expand Down
29 changes: 29 additions & 0 deletions langchain/src/document_loaders/tests/unstructured.int.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import * as url from "node:url";
import * as path from "node:path";
import { readFile } from "node:fs/promises";
import { test, expect } from "@jest/globals";
import {
UnstructuredDirectoryLoader,
Expand All @@ -29,6 +30,34 @@ test.skip("Test Unstructured base loader", async () => {
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey team, just a heads up that I've flagged a change in the PR related to accessing an environment variable using process.env. This is for your review to ensure proper handling of environment variables. Keep up the great work!

Copy link
Contributor Author

@andrewdoro andrewdoro May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right, I've run the tests locally using a docker instance of unstructured

});

test.skip("Test Unstructured base loader with buffer", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data/example.txt"
);

const options = {
apiKey: process.env.UNSTRUCTURED_API_KEY!,
};

const buffer = await readFile(filePath);
const fileName = "example.txt";

const loader = new UnstructuredLoader(
{
buffer,
fileName,
},
options
);
const docs = await loader.load();

expect(docs.length).toBe(3);
for (const doc of docs) {
expect(typeof doc.pageContent).toBe("string");
}
});

test.skip("Test Unstructured base loader with fast strategy", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
Expand Down
46 changes: 36 additions & 10 deletions libs/langchain-community/src/document_loaders/fs/unstructured.ts
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,11 @@
unknown?: UnknownHandling;
};

type UnstructuredMemoryLoaderOptions = {
buffer: Buffer;
fileName: string;
};

/**
* A document loader that uses the Unstructured API to load unstructured
* documents. It supports both the new syntax with options object and the
Expand All @@ -131,6 +136,10 @@
export class UnstructuredLoader extends BaseDocumentLoader {
public filePath: string;

private buffer?: Buffer;

private fileName?: string;

private apiUrl = "https://api.unstructured.io/general/v0/general";

private apiKey?: string;
Expand Down Expand Up @@ -167,19 +176,30 @@
private maxCharacters?: number;

constructor(
filePathOrLegacyApiUrl: string,
filePathOrLegacyApiUrlOrMemoryBuffer:
| string
| UnstructuredMemoryLoaderOptions,
optionsOrLegacyFilePath: UnstructuredLoaderOptions | string = {}
) {
super();

// Temporary shim to avoid breaking existing users
// Remove when API keys are enforced by Unstructured and existing code will break anyway
const isLegacySyntax = typeof optionsOrLegacyFilePath === "string";
if (isLegacySyntax) {
const isMemorySyntax =
typeof filePathOrLegacyApiUrlOrMemoryBuffer === "object";

if (isMemorySyntax) {
this.buffer = filePathOrLegacyApiUrlOrMemoryBuffer.buffer;
this.fileName = filePathOrLegacyApiUrlOrMemoryBuffer.fileName;
} else if (isLegacySyntax) {
this.filePath = optionsOrLegacyFilePath;
this.apiUrl = filePathOrLegacyApiUrl;
this.apiUrl = filePathOrLegacyApiUrlOrMemoryBuffer;
} else {
this.filePath = filePathOrLegacyApiUrl;
this.filePath = filePathOrLegacyApiUrlOrMemoryBuffer;
}

if (!isLegacySyntax) {
const options = optionsOrLegacyFilePath;
this.apiKey =
options.apiKey ?? getEnvironmentVariable("UNSTRUCTURED_API_KEY");
Expand All @@ -205,14 +225,20 @@
}

async _partition() {
const { readFile, basename } = await this.imports();
let buffer = this.buffer;

Check failure on line 228 in libs/langchain-community/src/document_loaders/fs/unstructured.ts

View workflow job for this annotation

GitHub Actions / Check linting

Use object destructuring
let fileName = this.fileName;

Check failure on line 229 in libs/langchain-community/src/document_loaders/fs/unstructured.ts

View workflow job for this annotation

GitHub Actions / Check linting

Use object destructuring

if (!buffer) {
const { readFile, basename } = await this.imports();

const buffer = await readFile(this.filePath);
const fileName = basename(this.filePath);
buffer = await readFile(this.filePath);
fileName = basename(this.filePath);

// I'm aware this reads the file into memory first, but we have lots of work
// to do on then consuming Documents in a streaming fashion anyway, so not
// worried about this for now.
}

// I'm aware this reads the file into memory first, but we have lots of work
// to do on then consuming Documents in a streaming fashion anyway, so not
// worried about this for now.
const formData = new FormData();
formData.append("files", new Blob([buffer]), fileName);
formData.append("strategy", this.strategy);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import * as url from "node:url";
import * as path from "node:path";
import { readFile } from "node:fs/promises";
import { test, expect } from "@jest/globals";
import {
UnstructuredDirectoryLoader,
Expand All @@ -29,6 +30,34 @@ test.skip("Test Unstructured base loader", async () => {
}
});

test.skip("Test Unstructured base loader with buffer", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
"./example_data/example.txt"
);

const options = {
apiKey: process.env.UNSTRUCTURED_API_KEY!,
};

const buffer = await readFile(filePath);
const fileName = "example.txt";

const loader = new UnstructuredLoader(
{
buffer,
fileName,
},
options
);
const docs = await loader.load();

expect(docs.length).toBe(3);
for (const doc of docs) {
expect(typeof doc.pageContent).toBe("string");
}
});

test.skip("Test Unstructured base loader with fast strategy", async () => {
const filePath = path.resolve(
path.dirname(url.fileURLToPath(import.meta.url)),
Expand Down
Loading