parallel indexing is not giving the same recall percentage as serial indexing #108

msk-apk · 2024-07-08T08:22:49Z

I executed the two following test cases to replicate the scale test as documented in pgvectorscale

created a table search_item with a vector column of dimension 768. Created the diskann index with the following command.

CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

After creating the index, connected 40 connections each inserting 500000 entries in the search_item table with copy command. All copy commands completed in 90 minutes. In this approach, 40 clients are concurrently inserting data and diskann index is updated parallally by 40 processes. Totally search_item has 20 M entries.

Tried querying search_item table with QUERY_RESULT_SIZE as 100. I am able to get the result quickly with in 50 ms but the recall percentage is very low its around 40%

created a table search_item with a vector of dimension 768. Connected 40 clients to insert 20 M data into this table. Now created the index (i.e after updating the data to the table search_item) with the same command.

CREATE INDEX document_embedding_idx ON document_embedding
USING diskann (embedding);

This time it took long time to build the index

DEBUG: Writing took 82451.193284036s or 0.0041225596642018s/tuple. Avg neighbors: 50
DEBUG: When pruned for cleanup: avg neighbors before/after 56/0 of 18461082 prunes
WARNING: Indexed 20000000 tuples
DEBUG: EventTriggerInvoke 16788

But when I search with the same QUERY_RESULT_SIZE, i got around 99% recall with 90 ms mean query time.

The question is why the recall is very low in the first test, is it possible to improve the recall percentage in the first scenario?

msk-apk · 2024-07-09T08:02:21Z

Any reasons for not getting the expected recall on doing parallel indexing? We cant build the index periodically right?

cevian · 2024-07-16T18:22:54Z

@msk-apk I think the issue here is actually with the SBQ compression algorithm. It learns about the distribution of the data when the index is first built. So building the index on the empty table will produce much worse results than building it on a table that has data already in it.

There are many solutions we could try to implement but I think rebuilding the index periodically is the best approach for now.

msk-apk · 2024-07-17T04:29:44Z

yeah seems that is the case here. changing storage_type as plain, resolves the issue.

msk-apk changed the title ~~inline indexing is not giving the same recall percentage as serial indexing~~ parallel indexing is not giving the same recall percentage as serial indexing Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel indexing is not giving the same recall percentage as serial indexing #108

parallel indexing is not giving the same recall percentage as serial indexing #108

msk-apk commented Jul 8, 2024

msk-apk commented Jul 9, 2024

cevian commented Jul 16, 2024

msk-apk commented Jul 17, 2024

parallel indexing is not giving the same recall percentage as serial indexing #108

parallel indexing is not giving the same recall percentage as serial indexing #108

Comments

msk-apk commented Jul 8, 2024

msk-apk commented Jul 9, 2024

cevian commented Jul 16, 2024

msk-apk commented Jul 17, 2024