Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel indexing is not giving the same recall percentage as serial indexing #108

Open
msk-apk opened this issue Jul 8, 2024 · 3 comments

Comments

@msk-apk
Copy link

msk-apk commented Jul 8, 2024

I executed the two following test cases to replicate the scale test as documented in pgvectorscale

  1. created a table search_item with a vector column of dimension 768. Created the diskann index with the following command.

CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

After creating the index, connected 40 connections each inserting 500000 entries in the search_item table with copy command. All copy commands completed in 90 minutes. In this approach, 40 clients are concurrently inserting data and diskann index is updated parallally by 40 processes. Totally search_item has 20 M entries.

Tried querying search_item table with QUERY_RESULT_SIZE as 100. I am able to get the result quickly with in 50 ms but the recall percentage is very low its around 40%

  1. created a table search_item with a vector of dimension 768. Connected 40 clients to insert 20 M data into this table. Now created the index (i.e after updating the data to the table search_item) with the same command.

CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

This time it took long time to build the index

DEBUG: Writing took 82451.193284036s or 0.0041225596642018s/tuple. Avg neighbors: 50
DEBUG: When pruned for cleanup: avg neighbors before/after 56/0 of 18461082 prunes
WARNING: Indexed 20000000 tuples
DEBUG: EventTriggerInvoke 16788

But when I search with the same QUERY_RESULT_SIZE, i got around 99% recall with 90 ms mean query time.

The question is why the recall is very low in the first test, is it possible to improve the recall percentage in the first scenario?

@msk-apk msk-apk changed the title inline indexing is not giving the same recall percentage as serial indexing parallel indexing is not giving the same recall percentage as serial indexing Jul 8, 2024
@msk-apk
Copy link
Author

msk-apk commented Jul 9, 2024

Any reasons for not getting the expected recall on doing parallel indexing? We cant build the index periodically right?

@cevian
Copy link
Collaborator

cevian commented Jul 16, 2024

@msk-apk I think the issue here is actually with the SBQ compression algorithm. It learns about the distribution of the data when the index is first built. So building the index on the empty table will produce much worse results than building it on a table that has data already in it.

There are many solutions we could try to implement but I think rebuilding the index periodically is the best approach for now.

@msk-apk
Copy link
Author

msk-apk commented Jul 17, 2024

yeah seems that is the case here. changing storage_type as plain, resolves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants