Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[patch]: Add pgvector index using HNSW #5564

Merged
merged 10 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions docs/core_docs/docs/integrations/vectorstores/pgvector.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,24 @@ before using the constructor.
import ConnectionReuseExample from "@examples/indexes/vector_stores/pgvector_vectorstore/pgvector_pool.ts";

<CodeBlock language="typescript">{ConnectionReuseExample}</CodeBlock>

### Create HNSW Index

By default, the extension performs a sequential scan search, with 100% recall. You might consider creating an HNSW index for approximate nearest neighbor (ANN) search to speed up similaritySearchVectorWithScore execution time. To create the HNSW index on your vector column, use the `createHnswIndex()` method:

The method parameters include:

**dimensions**: Defines the number of dimensions in your vector data type, up to 2000. For example, use 1536 for OpenAI's `text-embedding-ada-002` and Amazon's `amazon.titan-embed-text-v1` models.

**m?**: The max number of connections per layer (16 by default). Index build time improves with smaller values, while higher values can speed up search queries.

**efConstruction?**: The size of the dynamic candidate list for constructing the graph (64 by default). A higher value can potentially improve the index quality at the cost of index build time.

**distanceFunction?**: The distance function name you want to use, is automatically selected based on the distanceStrategy.

More info at the [`Pgvector GitHub project`](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw) and the HNSW paper from Malkov Yu A. and Yashunin D. A.. 2020. [`Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs`](https://arxiv.org/pdf/1603.09320)

import HnswExample from "@examples/indexes/vector_stores/pgvector_vectorstore/pgvector_hnsw.ts";

<CodeBlock language="typescript">{HnswExample}</CodeBlock>

Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import { OpenAIEmbeddings } from "@langchain/openai";
import {
DistanceStrategy,
PGVectorStore,
} from "@langchain/community/vectorstores/pgvector";
import { PoolConfig } from "pg";

// First, follow set-up instructions at
// https://js.langchain.com/docs/modules/indexes/vector_stores/integrations/pgvector

const config = {
postgresConnectionOptions: {
type: "postgres",
host: "127.0.0.1",
port: 5433,
user: "myuser",
password: "ChangeMe",
database: "api",
} as PoolConfig,
tableName: "testlangchain",
columns: {
idColumnName: "id",
vectorColumnName: "vector",
contentColumnName: "content",
metadataColumnName: "metadata",
},
// supported distance strategies: cosine (default), innerProduct, or euclidean
distanceStrategy: "cosine" as DistanceStrategy,
};

const pgvectorStore = await PGVectorStore.initialize(
new OpenAIEmbeddings(),
config
);

// create the index
await pgvectorStore.createHnswIndex({
dimensions: 1536,
efConstruction: 64,
m: 16,
});

await pgvectorStore.addDocuments([
{ pageContent: "what's this", metadata: { a: 2, b: ["tag1", "tag2"] } },
{ pageContent: "Cat drinks milk", metadata: { a: 1, b: ["tag2"] } },
]);

const model = new OpenAIEmbeddings();
const query = await model.embedQuery("water");
const results = await pgvectorStore.similaritySearchVectorWithScore(query, 1);

console.log(results);

await pgvectorStore.end();
51 changes: 51 additions & 0 deletions libs/langchain-community/src/vectorstores/pgvector.ts
Original file line number Diff line number Diff line change
Expand Up @@ -677,4 +677,55 @@ export class PGVectorStore extends VectorStore {
this.client?.release();
return this.pool.end();
}

/**
* Method to create the HNSW index on the vector column.
*
* @param dimensions - Defines the number of dimensions in your vector data type, up to 2000. For example, use 1536 for OpenAI's text-embedding-ada-002 and Amazon's amazon.titan-embed-text-v1 models.
* @param m - The max number of connections per layer (16 by default). Index build time improves with smaller values, while higher values can speed up search queries.
* @param efConstruction - The size of the dynamic candidate list for constructing the graph (64 by default). A higher value can potentially improve the index quality at the cost of index build time.
* @param distanceFunction - The distance function name you want to use, is automatically selected based on the distanceStrategy.
* @returns Promise that resolves with the query response of creating the index.
*/
async createHnswIndex(config: {
dimensions: number;
m?: number;
efConstruction?: number;
distanceFunction?: string;
}): Promise<void> {
let idxDistanceFunction = config?.distanceFunction || "vector_cosine_ops";

switch (this.distanceStrategy) {
case "cosine":
idxDistanceFunction = "vector_cosine_ops";
break;
case "innerProduct":
idxDistanceFunction = "vector_ip_ops";
break;
case "euclidean":
idxDistanceFunction = "vector_l2_ops";
break;
default:
throw new Error(`Unknown distance strategy: ${this.distanceStrategy}`);
}

const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${
this.vectorColumnName
}_embedding_hnsw_idx
ON ${this.computedTableName} USING hnsw ((${
this.vectorColumnName
}::vector(${config.dimensions})) ${idxDistanceFunction})
WITH (
m=${config?.m || 16},
ef_construction=${config?.efConstruction || 64}
);`;

try {
await this.pool.query(createIndexQuery);
} catch (e) {
console.error(
`Failed to create HNSW index on table ${this.computedTableName}, error: ${e}`
);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,30 @@ import { expect, test } from "@jest/globals";
import pg, { PoolConfig } from "pg";
import { OpenAIEmbeddings } from "@langchain/openai";
import { PGVectorStore, PGVectorStoreArgs } from "../../pgvector.js";
// import { BedrockEmbeddings } from "../../../embeddings/bedrock.js";

const embeddingsEngine = new OpenAIEmbeddings();

// const embeddingsEngine = new BedrockEmbeddings({
// region: "us-east-1",
// });

const postgresConnectionOptions = {
type: "postgres",
host: "127.0.0.1",
port: 5432,
user: "myuser",
password: "ChangeMe",
database: "api",
} as PoolConfig;

describe("PGVectorStore", () => {
let pgvectorVectorStore: PGVectorStore;
const tableName = "testlangchain";

beforeAll(async () => {
const config = {
postgresConnectionOptions: {
type: "postgres",
host: "127.0.0.1",
port: 5432,
user: "myuser",
password: "ChangeMe",
database: "api",
} as PoolConfig,
const config: PGVectorStoreArgs = {
postgresConnectionOptions,
tableName: "testlangchain",
// collectionTableName: "langchain_pg_collection",
// collectionName: "langchain",
Expand All @@ -29,7 +38,7 @@ describe("PGVectorStore", () => {
};

pgvectorVectorStore = await PGVectorStore.initialize(
new OpenAIEmbeddings(),
embeddingsEngine,
config
);
});
Expand Down Expand Up @@ -297,14 +306,7 @@ describe("PGVectorStore with collection", () => {

beforeAll(async () => {
const config = {
postgresConnectionOptions: {
type: "postgres",
host: "127.0.0.1",
port: 5432,
user: "myuser",
password: "ChangeMe",
database: "api",
} as PoolConfig,
postgresConnectionOptions,
tableName,
collectionTableName,
collectionName: "langchain",
Expand All @@ -317,7 +319,7 @@ describe("PGVectorStore with collection", () => {
};

pgvectorVectorStore = await PGVectorStore.initialize(
new OpenAIEmbeddings(),
embeddingsEngine,
config
);
});
Expand Down Expand Up @@ -535,13 +537,7 @@ describe("PGVectorStore with schema", () => {
let pool: pg.Pool;

beforeAll(async () => {
pool = new pg.Pool({
host: "127.0.0.1",
port: 5432,
user: "myuser",
password: "ChangeMe",
database: "api",
});
pool = new pg.Pool(postgresConnectionOptions);

const config: PGVectorStoreArgs = {
pool,
Expand All @@ -560,7 +556,7 @@ describe("PGVectorStore with schema", () => {
await pool.query(`CREATE SCHEMA IF NOT EXISTS ${schema}`);

pgvectorVectorStore = await PGVectorStore.initialize(
new OpenAIEmbeddings(),
embeddingsEngine,
config
);
computedTableName = pgvectorVectorStore.computedTableName;
Expand Down Expand Up @@ -773,3 +769,74 @@ describe("PGVectorStore with schema", () => {
}
});
});

describe("PGVectorStore with HNSW index", () => {
let pgvectorVectorStore: PGVectorStore;
const tableName = "testlangchain";

beforeAll(async () => {
const config: PGVectorStoreArgs = {
postgresConnectionOptions,
tableName: "testlangchain",
columns: {
idColumnName: "id",
vectorColumnName: "vector",
contentColumnName: "content",
metadataColumnName: "metadata",
},
distanceStrategy: "cosine",
};

pgvectorVectorStore = await PGVectorStore.initialize(
embeddingsEngine,
config
);

// Create the index
await pgvectorVectorStore.createHnswIndex({ dimensions: 1536 });
});

afterEach(async () => {
// Drop table, then recreate it for the next test.
await pgvectorVectorStore.pool.query(`DROP TABLE "${tableName}"`);
await pgvectorVectorStore.ensureTableInDatabase();
await pgvectorVectorStore.createHnswIndex({ dimensions: 1536 });
});

afterAll(async () => {
await pgvectorVectorStore.end();
});

test("Ensure table has HNSW index", async () => {
const result = await pgvectorVectorStore.pool.query(
`SELECT indexname, tablename, indexdef FROM pg_indexes where indexname='vector_embedding_hnsw_idx';`
);
const { indexdef } = result.rows[0];
expect(result.rowCount).toBe(1);
expect(indexdef.includes("USING hnsw")).toBe(true);
});

test("Test embeddings creation", async () => {
const documents = [
{
pageContent: "hello",
metadata: { a: 1 },
},
{
pageContent: "Cat drinks milk",
metadata: { a: 2 },
},
{ pageContent: "hi", metadata: { a: 1 } },
];
await pgvectorVectorStore.addDocuments(documents);

const query = await embeddingsEngine.embedQuery("milk");
const results = await pgvectorVectorStore.similaritySearchVectorWithScore(
query,
1
);

expect(results).toHaveLength(1);
expect(results[0][0].pageContent).toEqual("Cat drinks milk");
});
});