You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I executed the two following test cases to replicate the scale test as documented in pgvectorscale
created a table search_item with a vector column of dimension 768. Created the diskann index with the following command.
CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);
After creating the index, connected 40 connections each inserting 500000 entries in the search_item table with copy command. All copy commands completed in 90 minutes. In this approach, 40 clients are concurrently inserting data and diskann index is updated parallally by 40 processes. Totally search_item has 20 M entries.
Tried querying search_item table with QUERY_RESULT_SIZE as 100. I am able to get the result quickly with in 50 ms but the recall percentage is very low its around 40%
created a table search_item with a vector of dimension 768. Connected 40 clients to insert 20 M data into this table. Now created the index (i.e after updating the data to the table search_item) with the same command.
CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);
This time it took long time to build the index
DEBUG: Writing took 82451.193284036s or 0.0041225596642018s/tuple. Avg neighbors: 50
DEBUG: When pruned for cleanup: avg neighbors before/after 56/0 of 18461082 prunes
WARNING: Indexed 20000000 tuples
DEBUG: EventTriggerInvoke 16788
But when I search with the same QUERY_RESULT_SIZE, i got around 99% recall with 90 ms mean query time.
The question is why the recall is very low in the first test, is it possible to improve the recall percentage in the first scenario?
The text was updated successfully, but these errors were encountered:
msk-apk
changed the title
inline indexing is not giving the same recall percentage as serial indexing
parallel indexing is not giving the same recall percentage as serial indexing
Jul 8, 2024
@msk-apk I think the issue here is actually with the SBQ compression algorithm. It learns about the distribution of the data when the index is first built. So building the index on the empty table will produce much worse results than building it on a table that has data already in it.
There are many solutions we could try to implement but I think rebuilding the index periodically is the best approach for now.
I executed the two following test cases to replicate the scale test as documented in pgvectorscale
CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);
After creating the index, connected 40 connections each inserting 500000 entries in the search_item table with copy command. All copy commands completed in 90 minutes. In this approach, 40 clients are concurrently inserting data and diskann index is updated parallally by 40 processes. Totally search_item has 20 M entries.
Tried querying search_item table with QUERY_RESULT_SIZE as 100. I am able to get the result quickly with in 50 ms but the recall percentage is very low its around 40%
CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);
This time it took long time to build the index
DEBUG: Writing took 82451.193284036s or 0.0041225596642018s/tuple. Avg neighbors: 50
DEBUG: When pruned for cleanup: avg neighbors before/after 56/0 of 18461082 prunes
WARNING: Indexed 20000000 tuples
DEBUG: EventTriggerInvoke 16788
But when I search with the same QUERY_RESULT_SIZE, i got around 99% recall with 90 ms mean query time.
The question is why the recall is very low in the first test, is it possible to improve the recall percentage in the first scenario?
The text was updated successfully, but these errors were encountered: