Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Starling Index to Knowhere #907

Open
aawang1999 opened this issue Oct 22, 2024 · 12 comments
Open

Adding Starling Index to Knowhere #907

aawang1999 opened this issue Oct 22, 2024 · 12 comments
Assignees
Labels

Comments

@aawang1999
Copy link

My development team is trying to add the Starling index to Knowhere. I understand that the process of adding indices is briefly outlined on the Milvus Deep Dive page (linked here), but I was wondering if more detailed instructions could be provided on how to modify the Knowhere code? Assistance would be greatly appreciated.

@liliu-z
Copy link
Collaborator

liliu-z commented Oct 22, 2024

/assign @PwzXxm

@liliu-z
Copy link
Collaborator

liliu-z commented Oct 22, 2024

My development team is trying to add the Starling index to Knowhere. I understand that the process of adding indices is briefly outlined on the Milvus Deep Dive page (linked here), but I was wondering if more detailed instructions could be provided on how to modify the Knowhere code? Assistance would be greatly appreciated.

@PwzXxm is the author of Starling, he can help on this

@PwzXxm
Copy link
Collaborator

PwzXxm commented Oct 22, 2024

Hi there, thanks for your interest on contributing to Knowhere. May I ask what is the initiative for adding Starling to Knowhere so I can assist u better? Are you planning to add it to Milvus as well?

For adding index to Knowhere alone, you might take a look at this example adding SCANN https://github.com/zilliztech/knowhere/pull/1/files

Another feasible proposal might be not adding a new index type to Knowhere, but adding parameters to DiskANN Index.

@aawang1999
Copy link
Author

Thanks for the information! We will definitely look into those.

Our team was experimenting with different vector indices and found Starling. Since Starling was created by Milvus engineers, we felt it would be appropriate to integrate it into Milvus and run experiments in terms of performance, accuracy, and stability.

Quick follow-up question: Once an index is added to Knowhere, what does the larger process for registering it in Milvus look like? Is there an analogous pull request like this? Thanks!

@PwzXxm
Copy link
Collaborator

PwzXxm commented Oct 23, 2024

Our team was experimenting with different vector indices and found Starling. Since Starling was created by Milvus engineers, we felt it would be appropriate to integrate it into Milvus and run experiments in terms of performance, accuracy, and stability.

I was wondering what is the use-case and I assume u have already checked out other in-memory indices or DiskANN?

Quick follow-up question: Once an index is added to Knowhere, what does the larger process for registering it in Milvus look like? Is there an analogous pull request like this? Thanks!

Registering it in Milvus is not a heavy load.
milvus-io/milvus#26099
milvus-io/milvus#27268
These PRs are quite old, the param checks are moving into knowhere BTW.

@gpailetnet
Copy link

Hi @PwzXxm, I'm also a part of the team that @aawang1999 is in.

The use-case would be for a high-performance index at large levels of scale, the goal being to leverage both the capability of Starling, which has much faster performance to DiskANN due to its optimizations, as well as disk-based scalability. Some questions from looking into the different code repositories:

  1. Are there any large differences between the DiskANN implementation on Knowhere versus the Github repo's version of DiskANN? I know the latter has support for in-house filtering as well as streaming support through https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf and https://arxiv.org/pdf/2105.09613 respectively - I saw code for these aspects in Starling and was wondering if there are any issues to concern with the implementation in Knowhere - to my knowledge, Milvus will accumulate points in an open segment and then build an index once on a sealed segment, then closing it, but I want to make sure if there's anything I am missing.
  2. Are there any other 'environmental' differences I should be concerned about between the setups that DiskANN and Starling work in in which they operate over the whole database as opposed to Milvus, in which each segment has its own index for knowing what implementation constraints to meet?

@PwzXxm
Copy link
Collaborator

PwzXxm commented Nov 8, 2024

  1. Are there any large differences between the DiskANN implementation on Knowhere versus the Github repo's version of DiskANN? I know the latter has support for in-house filtering as well as streaming support through https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf and https://arxiv.org/pdf/2105.09613 respectively - I saw code for these aspects in Starling and was wondering if there are any issues to concern with the implementation in Knowhere - to my knowledge, Milvus will accumulate points in an open segment and then build an index once on a sealed segment, then closing it, but I want to make sure if there's anything I am missing.

Filtering is approached differently in Milvus, compared to Filtered-DiskANN. In Milvus, the filter condition is evaluated before KNN search, so in Knowhere, the DiskANN only sees a bitset stating which element is valid or not. As for Fresh-DiskANN, it has some overlaps on our growing/sealed segments design, so we haven't update the DiskANN in Knowhere for a while.

  1. Are there any other 'environmental' differences I should be concerned about between the setups that DiskANN and Starling work in in which they operate over the whole database as opposed to Milvus, in which each segment has its own index for knowing what implementation constraints to meet?

Indices operate on segment-level and I think it would be fine. Keep in mind that the offset/id on the segment level needs to be preserved, so if you relayout them via Starling, mappings are needed to return the corrected IDs to Milvus.

@gpailetnet
Copy link

Thank you so much for your input! One question for the DiskANN index creation process on Knowhere; I'm somewhat confused as to the distinction between the Build() process and the Deserialize() process: to my understanding, Build() normally takes a dataset as input, for which it creates a (usually in-memory) index and can serialize to disk/object storage, which it can then deserialize from to load into memory. However, since DiskANN is already disk-based, doesn't Build() already accomplish the purpose of deserialization/loading, since it's configured to read the requisite files from disk to create the index anyways? I'm a bit confused as to the difference between the loadings between the two methods, though I assume that it's that after Build() completes, the index structure can clear its in-memory structures, whereas Deserialize() implies it must hold onto those in-memory structures for search? Since Starling has the in-memory graph, I assume that would be loaded in Deserialize()

@gpailetnet
Copy link

Indices operate on segment-level and I think it would be fine. Keep in mind that the offset/id on the segment level needs to be preserved, so if you relayout them via Starling, mappings are needed to return the corrected IDs to Milvus.

Are you referring to anything in addition to the id-page and page-id mappings that Starling does? Going through the code it seems that when loading the graph partition data these mappings are loaded and then used for page search; is this sufficient, or is there anything in addition one needs to do for the Milvus/Knowhere environment?

@PwzXxm
Copy link
Collaborator

PwzXxm commented Nov 25, 2024

Thank you so much for your input! One question for the DiskANN index creation process on Knowhere; I'm somewhat confused as to the distinction between the Build() process and the Deserialize() process: to my understanding, Build() normally takes a dataset as input, for which it creates a (usually in-memory) index and can serialize to disk/object storage, which it can then deserialize from to load into memory. However, since DiskANN is already disk-based, doesn't Build() already accomplish the purpose of deserialization/loading, since it's configured to read the requisite files from disk to create the index anyways? I'm a bit confused as to the difference between the loadings between the two methods, though I assume that it's that after Build() completes, the index structure can clear its in-memory structures, whereas Deserialize() implies it must hold onto those in-memory structures for search? Since Starling has the in-memory graph, I assume that would be loaded in Deserialize()

The Build() and Deserialize() may be triggered on different physical machines, on IndexNode and QueryNode if you are familiar with Milvus terms by any chance. The built index files needs to transfer to another node and call Deserialize() so that it is ready to serve queries. It only puts PQ and cache in memory, the others remains on disk.

Yes, the loading of the In-Memory Graph would be in Deserialize().

@PwzXxm
Copy link
Collaborator

PwzXxm commented Nov 25, 2024

Indices operate on segment-level and I think it would be fine. Keep in mind that the offset/id on the segment level needs to be preserved, so if you relayout them via Starling, mappings are needed to return the corrected IDs to Milvus.

Are you referring to anything in addition to the id-page and page-id mappings that Starling does? Going through the code it seems that when loading the graph partition data these mappings are loaded and then used for page search; is this sufficient, or is there anything in addition one needs to do for the Milvus/Knowhere environment?

Yes, it should have already been taken care of.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@github-actions github-actions bot added the stale label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants