Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Join two collections #35500

Open
1 task done
xiaofan-luan opened this issue Aug 15, 2024 · 4 comments
Open
1 task done

[Feature]: Join two collections #35500

xiaofan-luan opened this issue Aug 15, 2024 · 4 comments
Assignees
Labels
kind/feature Issues related to feature request from users

Comments

@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

Is your feature request related to a problem? Please describe.

Under some use cases, user need to search for topk for each entity of the other collections.

This can be called as a Knn Join or semantic join.

Simply list it here and wait for more discussion

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

@xiaofan-luan xiaofan-luan added the kind/feature Issues related to feature request from users label Aug 15, 2024
@xiaofan-luan xiaofan-luan self-assigned this Aug 15, 2024
@chasingegg
Copy link
Contributor

We could have something like batching search in vector search engine, this is helpful when we use IVF related indexes, we can group the same posting lists for different queries and do the matrix computation to improve qps.

@xiaofan-luan
Copy link
Collaborator Author

That is exactly what I'm thinking.
To implement this, we need

  1. LRU on segments (usuaully we don't need to load everything into main memory)
  2. Batch search on all segments (typically NQ == 100k)
  3. Using GPU or other batch optimizations in index.
    Under this mode, we don't really need to do batch insertion

@xiaofan-luan
Copy link
Collaborator Author

@liliu-z @chasingegg thoughts on it?

@liliu-z
Copy link
Member

liliu-z commented Aug 19, 2024

  1. An async/cron job API is needed.
  2. It is a general operation that can apply to any indexes and cache strategies (Segment LRU, all Memory, etc.). But we have some prefer combination.
  3. It can be a Map-Reduce pattern, we first do batch searches and store results on a cronjob leader node (maybe delegator). And then do a reduce work upon it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

3 participants