Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community[patch]: Add pgvector index using HNSW #5564

Merged
merged 10 commits into from
Jun 25, 2024

Conversation

jl4nz
Copy link
Contributor

@jl4nz jl4nz commented May 28, 2024

Add support to optionally create a HNSW index on for pgvector

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 28, 2024
Copy link

vercel bot commented May 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 13, 2024 10:21pm
langchainjs-docs ✅ Ready (Inspect) Visit Preview Jun 13, 2024 10:21pm

@jl4nz jl4nz changed the title Add pgvector hnsw Add pgvector index using HNSW May 28, 2024
@dosubot dosubot bot added auto:documentation Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels May 28, 2024
@jl4nz
Copy link
Contributor Author

jl4nz commented May 28, 2024


const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${
this.computedTableName
}_embedding_idx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this name be differentiated based on the number of dimensions and/or column?

Could maybe see a case where someone has multiple embedding fields in a given table?

Copy link
Contributor Author

@jl4nz jl4nz May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this name be differentiated based on the number of dimensions and/or column?

As far as I can see, there's only support to create one embedding field per table. I'll change the index name to use the vectorColumnName that's computed in the class for consistency.

Could maybe see a case where someone has multiple embedding fields in a given table?

Haven't seen this pattern yet... I guess is possible, but IMO its easier to handle one vector per table + metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works for now

@jacoblee93
Copy link
Collaborator

jacoblee93 commented May 28, 2024

Very cool!

@jacoblee93
Copy link
Collaborator

Hey apologies for the delay - @eyurtsev is on vacation but I'd really like him to take a look. Might be a few more days.

* @returns Promise that resolves with the query response of creating the index.
*/
async createHnswIndex(config?: {
dims?: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we require the dimension instead of assigning a default? Users likely use lots of different embedding models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored to be mandatory... I also think that assigning the value is less prone error.

* Method to create the HNSW index on the vector column.
*
* @param dims - Defines the number of dimensions in your vector data, max: 2000. For example, use 1536 for OpenAI's text-embedding-ada-002 model and 1024 for amazon.titan-embed-text-v2:0
* @param m - The max number of connections per layer (16 by default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reference material we could link to that would help someone figure out how to choose values for m?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a reference to the paper in the docs (pgvector.mdx) and added a bit more context in the comments for m and efConstruction parameters

Looks like this:

More info at the Pgvector GitHub project and the HNSW paper from Malkov Yu A. and Yashunin D. A.. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

}_embedding_idx
ON ${this.computedTableName} USING hnsw ((${
this.vectorColumnName
}::vector(${config?.dims || 1536})) ${idxDistanceFunction})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any utilities to sanitize interpolated values? This code should be only exposed to developers, but if we have existing sanitization code won't hurt to add some more defensive code

Copy link
Contributor Author

@jl4nz jl4nz Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the pg-format lib https://www.npmjs.com/package/pg-format to format the sql literals.
It should provide some guards for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually remove this (for now)? It'll make existing users install a new package.

In the near-ish term, we should factor out into a @langchain/pgvector package that can have this as a hard dep.


const createIndexQuery = `CREATE INDEX IF NOT EXISTS ${
this.computedTableName
}_embedding_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works for now

@jacoblee93
Copy link
Collaborator

Hey @jl4nz, can we revert that new pg-format package addition? Will break people who are using the current version. We can add a TODO to add it later.

@jl4nz jl4nz force-pushed the pg-hnsw-create-index branch from 9adedd5 to 7bfec42 Compare June 13, 2024 22:08
@jl4nz
Copy link
Contributor Author

jl4nz commented Jun 13, 2024

Hey @jl4nz, can we revert that new pg-format package addition? Will break people who are using the current version. We can add a TODO to add it later.

Got it, done.

@jl4nz
Copy link
Contributor Author

jl4nz commented Jun 24, 2024

Hey @jacoblee93, just pinging to check if this would be good to go.
Thanks!

@jacoblee93
Copy link
Collaborator

Yes, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:documentation Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder auto:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features lgtm PRs that are ready to be merged as-is size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants