community[patch]: Add pgvector index using HNSW #5564

jl4nz · 2024-05-28T01:57:38Z

Add support to optionally create a HNSW index on for pgvector

vercel · 2024-05-28T01:57:42Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchainjs-api-refs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jun 13, 2024 10:21pm
langchainjs-docs	✅ Ready (Inspect)	Visit Preview		Jun 13, 2024 10:21pm

jl4nz · 2024-05-28T01:58:53Z

Adding this for reference https://python.langchain.com/v0.1/docs/integrations/vectorstores/pgembedding/#create-hnsw-index

jacoblee93 · 2024-05-28T23:25:31Z

libs/langchain-community/src/vectorstores/pgvector.ts

+
+    const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${
+      this.computedTableName
+    }_embedding_idx


Should this name be differentiated based on the number of dimensions and/or column?

Could maybe see a case where someone has multiple embedding fields in a given table?

Should this name be differentiated based on the number of dimensions and/or column?

As far as I can see, there's only support to create one embedding field per table. I'll change the index name to use the vectorColumnName that's computed in the class for consistency.

Could maybe see a case where someone has multiple embedding fields in a given table?

Haven't seen this pattern yet... I guess is possible, but IMO its easier to handle one vector per table + metadata.

I think this works for now

jacoblee93 · 2024-05-28T23:25:50Z

Very cool!

jacoblee93 · 2024-05-31T22:54:49Z

Hey apologies for the delay - @eyurtsev is on vacation but I'd really like him to take a look. Might be a few more days.

eyurtsev · 2024-06-07T15:07:19Z

libs/langchain-community/src/vectorstores/pgvector.ts

+   * @returns Promise that resolves with the query response of creating the index.
+   */
+  async createHnswIndex(config?: {
+    dims?: number;


Should we require the dimension instead of assigning a default? Users likely use lots of different embedding models?

I refactored to be mandatory... I also think that assigning the value is less prone error.

eyurtsev · 2024-06-07T15:07:52Z

libs/langchain-community/src/vectorstores/pgvector.ts

+   * Method to create the HNSW index on the vector column.
+   *
+   * @param dims - Defines the number of dimensions in your vector data, max: 2000. For example, use 1536 for OpenAI's text-embedding-ada-002 model and 1024 for amazon.titan-embed-text-v2:0
+   * @param m - The max number of connections per layer (16 by default)


Any reference material we could link to that would help someone figure out how to choose values for m?

I added a reference to the paper in the docs (pgvector.mdx) and added a bit more context in the comments for m and efConstruction parameters

Looks like this:

More info at the Pgvector GitHub project and the HNSW paper from Malkov Yu A. and Yashunin D. A.. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

eyurtsev · 2024-06-07T15:24:22Z

libs/langchain-community/src/vectorstores/pgvector.ts

+    }_embedding_idx
+        ON ${this.computedTableName} USING hnsw ((${
+      this.vectorColumnName
+    }::vector(${config?.dims || 1536})) ${idxDistanceFunction})


Do we have any utilities to sanitize interpolated values? This code should be only exposed to developers, but if we have existing sanitization code won't hurt to add some more defensive code

I've added the pg-format lib https://www.npmjs.com/package/pg-format to format the sql literals.
It should provide some guards for this.

Can we actually remove this (for now)? It'll make existing users install a new package.

In the near-ish term, we should factor out into a @langchain/pgvector package that can have this as a hard dep.

eyurtsev · 2024-06-07T15:24:27Z

libs/langchain-community/src/vectorstores/pgvector.ts

+
+    const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${
+      this.computedTableName
+    }_embedding_idx


I think this works for now

jacoblee93 · 2024-06-13T21:33:03Z

Hey @jl4nz, can we revert that new pg-format package addition? Will break people who are using the current version. We can add a TODO to add it later.

jl4nz · 2024-06-13T22:18:30Z

Hey @jl4nz, can we revert that new pg-format package addition? Will break people who are using the current version. We can add a TODO to add it later.

Got it, done.

jl4nz · 2024-06-24T06:16:19Z

Hey @jacoblee93, just pinging to check if this would be good to go.
Thanks!

jacoblee93 · 2024-06-25T17:28:47Z

Yes, thank you!

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 28, 2024

jl4nz changed the title ~~Add pgvector hnsw~~ Add pgvector index using HNSW May 28, 2024

dosubot bot added auto:documentation Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels May 28, 2024

vercel bot deployed to Preview – langchainjs-api-refs May 28, 2024 02:04 View deployment

vercel bot deployed to Preview – langchainjs-docs May 28, 2024 02:04 View deployment

jacoblee93 reviewed May 28, 2024

View reviewed changes

jacoblee93 changed the title ~~Add pgvector index using HNSW~~ community[patch]: Add pgvector index using HNSW May 28, 2024

vercel bot deployed to Preview – langchainjs-api-refs May 31, 2024 03:51 View deployment

vercel bot deployed to Preview – langchainjs-docs May 31, 2024 03:53 View deployment

vercel bot deployed to Preview – langchainjs-api-refs May 31, 2024 04:05 View deployment

vercel bot deployed to Preview – langchainjs-docs May 31, 2024 04:05 View deployment

eyurtsev reviewed Jun 7, 2024

View reviewed changes

vercel bot had a problem deploying to Preview – langchainjs-docs June 10, 2024 04:13 Failure

vercel bot had a problem deploying to Preview – langchainjs-docs June 10, 2024 04:22 Failure

vercel bot had a problem deploying to Preview – langchainjs-api-refs June 10, 2024 04:24 Failure

jl4nz added 9 commits June 14, 2024 06:58

Add pgvector hnsw

f246445

Update hnsw index name to use column name

1418306

Fix import typo

2717342

Fix pgvector index test

ff9d5fe

Refactor, set dimensions for hnsw mandatory, fix docs

25c0e07

Add pg-format, refactor create hnsw index with sql identifiers pg-format

a295119

Revert docs gitignore

3a7e85f

Revert docs gitignore

fd05ce4

Revert pg-format

7bfec42

jl4nz force-pushed the pg-hnsw-create-index branch from 9adedd5 to 7bfec42 Compare June 13, 2024 22:08

Revert gitignore docs

1b65eae

vercel bot deployed to Preview – langchainjs-api-refs June 13, 2024 22:21 View deployment

vercel bot deployed to Preview – langchainjs-docs June 13, 2024 22:21 View deployment

eyurtsev approved these changes Jun 25, 2024

View reviewed changes

dosubot bot added the lgtm PRs that are ready to be merged as-is label Jun 25, 2024

jacoblee93 merged commit 43829d6 into langchain-ai:main Jun 25, 2024
3 checks passed

abdulrahman305 mentioned this pull request Jul 4, 2024

[Snyk] Upgrade @langchain/community from 0.0.52 to 0.2.10 abdulrahman305/LocalAI#3

Merged

This was referenced Jul 17, 2024

[Snyk] Upgrade langchain from 0.0.163 to 0.2.7 gmickel/memorybot#12

Open

[Snyk] Upgrade langchain from 0.0.163 to 0.2.8 gmickel/memorybot#13

Open

Skrishnan586 mentioned this pull request Jul 22, 2024

[Snyk] Upgrade langchain from 0.0.213 to 0.2.8 harykryshnan-Master/chatbot-ui#5

Open

prathik2401 mentioned this pull request Jul 24, 2024

[Snyk] Upgrade langchain from 0.2.5 to 0.2.8 prathik2401/sentilog-deploy#4

Merged

abdulrahman305 mentioned this pull request Jul 29, 2024

[Snyk] Upgrade langchain from 0.0.96 to 0.2.8 abdulrahman305/langchain-chatbot-demo#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community[patch]: Add pgvector index using HNSW #5564

community[patch]: Add pgvector index using HNSW #5564

jl4nz commented May 28, 2024

vercel bot commented May 28, 2024 •

edited

Loading

jl4nz commented May 28, 2024

jacoblee93 May 28, 2024

jl4nz May 31, 2024 •

edited

Loading

eyurtsev Jun 7, 2024

jacoblee93 commented May 28, 2024 •

edited

Loading

jacoblee93 commented May 31, 2024

eyurtsev Jun 7, 2024

jl4nz Jun 10, 2024

eyurtsev Jun 7, 2024

jl4nz Jun 10, 2024

eyurtsev Jun 7, 2024

jl4nz Jun 10, 2024 •

edited

Loading

jacoblee93 Jun 13, 2024

eyurtsev Jun 7, 2024

jacoblee93 commented Jun 13, 2024

jl4nz commented Jun 13, 2024

jl4nz commented Jun 24, 2024

jacoblee93 commented Jun 25, 2024

community[patch]: Add pgvector index using HNSW #5564

community[patch]: Add pgvector index using HNSW #5564

Conversation

jl4nz commented May 28, 2024

vercel bot commented May 28, 2024 • edited Loading

jl4nz commented May 28, 2024

Choose a reason for hiding this comment

jl4nz May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacoblee93 commented May 28, 2024 • edited Loading

jacoblee93 commented May 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jl4nz Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacoblee93 commented Jun 13, 2024

jl4nz commented Jun 13, 2024

jl4nz commented Jun 24, 2024

jacoblee93 commented Jun 25, 2024

vercel bot commented May 28, 2024 •

edited

Loading

jl4nz May 31, 2024 •

edited

Loading

jacoblee93 commented May 28, 2024 •

edited

Loading

jl4nz Jun 10, 2024 •

edited

Loading