Adding Starling Index to Knowhere #907

aawang1999 · 2024-10-22T03:39:28Z

My development team is trying to add the Starling index to Knowhere. I understand that the process of adding indices is briefly outlined on the Milvus Deep Dive page (linked here), but I was wondering if more detailed instructions could be provided on how to modify the Knowhere code? Assistance would be greatly appreciated.

liliu-z · 2024-10-22T03:49:21Z

/assign @PwzXxm

liliu-z · 2024-10-22T03:55:13Z

My development team is trying to add the Starling index to Knowhere. I understand that the process of adding indices is briefly outlined on the Milvus Deep Dive page (linked here), but I was wondering if more detailed instructions could be provided on how to modify the Knowhere code? Assistance would be greatly appreciated.

@PwzXxm is the author of Starling, he can help on this

PwzXxm · 2024-10-22T06:23:27Z

Hi there, thanks for your interest on contributing to Knowhere. May I ask what is the initiative for adding Starling to Knowhere so I can assist u better? Are you planning to add it to Milvus as well?

For adding index to Knowhere alone, you might take a look at this example adding SCANN https://github.com/zilliztech/knowhere/pull/1/files

Another feasible proposal might be not adding a new index type to Knowhere, but adding parameters to DiskANN Index.

aawang1999 · 2024-10-22T23:58:49Z

Thanks for the information! We will definitely look into those.

Our team was experimenting with different vector indices and found Starling. Since Starling was created by Milvus engineers, we felt it would be appropriate to integrate it into Milvus and run experiments in terms of performance, accuracy, and stability.

Quick follow-up question: Once an index is added to Knowhere, what does the larger process for registering it in Milvus look like? Is there an analogous pull request like this? Thanks!

PwzXxm · 2024-10-23T03:35:11Z

Our team was experimenting with different vector indices and found Starling. Since Starling was created by Milvus engineers, we felt it would be appropriate to integrate it into Milvus and run experiments in terms of performance, accuracy, and stability.

I was wondering what is the use-case and I assume u have already checked out other in-memory indices or DiskANN?

Quick follow-up question: Once an index is added to Knowhere, what does the larger process for registering it in Milvus look like? Is there an analogous pull request like this? Thanks!

Registering it in Milvus is not a heavy load.
milvus-io/milvus#26099
milvus-io/milvus#27268
These PRs are quite old, the param checks are moving into knowhere BTW.

gpailetnet · 2024-11-07T19:38:14Z

Hi @PwzXxm, I'm also a part of the team that @aawang1999 is in.

The use-case would be for a high-performance index at large levels of scale, the goal being to leverage both the capability of Starling, which has much faster performance to DiskANN due to its optimizations, as well as disk-based scalability. Some questions from looking into the different code repositories:

Are there any large differences between the DiskANN implementation on Knowhere versus the Github repo's version of DiskANN? I know the latter has support for in-house filtering as well as streaming support through https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf and https://arxiv.org/pdf/2105.09613 respectively - I saw code for these aspects in Starling and was wondering if there are any issues to concern with the implementation in Knowhere - to my knowledge, Milvus will accumulate points in an open segment and then build an index once on a sealed segment, then closing it, but I want to make sure if there's anything I am missing.
Are there any other 'environmental' differences I should be concerned about between the setups that DiskANN and Starling work in in which they operate over the whole database as opposed to Milvus, in which each segment has its own index for knowing what implementation constraints to meet?

PwzXxm · 2024-11-08T02:40:33Z

Are there any large differences between the DiskANN implementation on Knowhere versus the Github repo's version of DiskANN? I know the latter has support for in-house filtering as well as streaming support through https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf and https://arxiv.org/pdf/2105.09613 respectively - I saw code for these aspects in Starling and was wondering if there are any issues to concern with the implementation in Knowhere - to my knowledge, Milvus will accumulate points in an open segment and then build an index once on a sealed segment, then closing it, but I want to make sure if there's anything I am missing.

Filtering is approached differently in Milvus, compared to Filtered-DiskANN. In Milvus, the filter condition is evaluated before KNN search, so in Knowhere, the DiskANN only sees a bitset stating which element is valid or not. As for Fresh-DiskANN, it has some overlaps on our growing/sealed segments design, so we haven't update the DiskANN in Knowhere for a while.

Are there any other 'environmental' differences I should be concerned about between the setups that DiskANN and Starling work in in which they operate over the whole database as opposed to Milvus, in which each segment has its own index for knowing what implementation constraints to meet?

Indices operate on segment-level and I think it would be fine. Keep in mind that the offset/id on the segment level needs to be preserved, so if you relayout them via Starling, mappings are needed to return the corrected IDs to Milvus.

gpailetnet · 2024-11-22T09:29:14Z

Thank you so much for your input! One question for the DiskANN index creation process on Knowhere; I'm somewhat confused as to the distinction between the Build() process and the Deserialize() process: to my understanding, Build() normally takes a dataset as input, for which it creates a (usually in-memory) index and can serialize to disk/object storage, which it can then deserialize from to load into memory. However, since DiskANN is already disk-based, doesn't Build() already accomplish the purpose of deserialization/loading, since it's configured to read the requisite files from disk to create the index anyways? I'm a bit confused as to the difference between the loadings between the two methods, though I assume that it's that after Build() completes, the index structure can clear its in-memory structures, whereas Deserialize() implies it must hold onto those in-memory structures for search? Since Starling has the in-memory graph, I assume that would be loaded in Deserialize()

gpailetnet · 2024-11-22T09:36:33Z

Indices operate on segment-level and I think it would be fine. Keep in mind that the offset/id on the segment level needs to be preserved, so if you relayout them via Starling, mappings are needed to return the corrected IDs to Milvus.

Are you referring to anything in addition to the id-page and page-id mappings that Starling does? Going through the code it seems that when loading the graph partition data these mappings are loaded and then used for page search; is this sufficient, or is there anything in addition one needs to do for the Milvus/Knowhere environment?

PwzXxm · 2024-11-25T02:51:43Z

Thank you so much for your input! One question for the DiskANN index creation process on Knowhere; I'm somewhat confused as to the distinction between the Build() process and the Deserialize() process: to my understanding, Build() normally takes a dataset as input, for which it creates a (usually in-memory) index and can serialize to disk/object storage, which it can then deserialize from to load into memory. However, since DiskANN is already disk-based, doesn't Build() already accomplish the purpose of deserialization/loading, since it's configured to read the requisite files from disk to create the index anyways? I'm a bit confused as to the difference between the loadings between the two methods, though I assume that it's that after Build() completes, the index structure can clear its in-memory structures, whereas Deserialize() implies it must hold onto those in-memory structures for search? Since Starling has the in-memory graph, I assume that would be loaded in Deserialize()

The Build() and Deserialize() may be triggered on different physical machines, on IndexNode and QueryNode if you are familiar with Milvus terms by any chance. The built index files needs to transfer to another node and call Deserialize() so that it is ready to serve queries. It only puts PQ and cache in memory, the others remains on disk.

Yes, the loading of the In-Memory Graph would be in Deserialize().

PwzXxm · 2024-11-25T03:12:21Z

Indices operate on segment-level and I think it would be fine. Keep in mind that the offset/id on the segment level needs to be preserved, so if you relayout them via Starling, mappings are needed to return the corrected IDs to Milvus.

Are you referring to anything in addition to the id-page and page-id mappings that Starling does? Going through the code it seems that when loading the graph partition data these mappings are loaded and then used for page search; is this sufficient, or is there anything in addition one needs to do for the Milvus/Knowhere environment?

Yes, it should have already been taken care of.

github-actions · 2024-12-26T02:02:44Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

sre-ci-robot assigned PwzXxm Oct 22, 2024

github-actions bot added the stale label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Starling Index to Knowhere #907

Adding Starling Index to Knowhere #907

aawang1999 commented Oct 22, 2024

liliu-z commented Oct 22, 2024

liliu-z commented Oct 22, 2024

PwzXxm commented Oct 22, 2024

aawang1999 commented Oct 22, 2024

PwzXxm commented Oct 23, 2024 •

edited

Loading

gpailetnet commented Nov 7, 2024

PwzXxm commented Nov 8, 2024 •

edited

Loading

gpailetnet commented Nov 22, 2024

gpailetnet commented Nov 22, 2024

PwzXxm commented Nov 25, 2024

PwzXxm commented Nov 25, 2024

github-actions bot commented Dec 26, 2024

Adding Starling Index to Knowhere #907

Adding Starling Index to Knowhere #907

Comments

aawang1999 commented Oct 22, 2024

liliu-z commented Oct 22, 2024

liliu-z commented Oct 22, 2024

PwzXxm commented Oct 22, 2024

aawang1999 commented Oct 22, 2024

PwzXxm commented Oct 23, 2024 • edited Loading

gpailetnet commented Nov 7, 2024

PwzXxm commented Nov 8, 2024 • edited Loading

gpailetnet commented Nov 22, 2024

gpailetnet commented Nov 22, 2024

PwzXxm commented Nov 25, 2024

PwzXxm commented Nov 25, 2024

github-actions bot commented Dec 26, 2024

PwzXxm commented Oct 23, 2024 •

edited

Loading

PwzXxm commented Nov 8, 2024 •

edited

Loading