Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Milvus String Scalar filter is very slow #33538

Closed
1 task done
JackTan25 opened this issue Jun 2, 2024 · 10 comments
Closed
1 task done

[Enhancement]: Milvus String Scalar filter is very slow #33538

JackTan25 opened this issue Jun 2, 2024 · 10 comments
Labels
kind/enhancement Issues or changes related to enhancement stale indicates no udpates for 30 days

Comments

@JackTan25
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

I need to know where the scalar string filter code is.

Why is this needed?

Performance improvement

Anything else?

No response

@JackTan25 JackTan25 added the kind/enhancement Issues or changes related to enhancement label Jun 2, 2024
@xiaofan-luan
Copy link
Collaborator

milvus string is highly optimized with simd and usually much faster than any other vector DB on the planet.

If you can give more information that would help on the investigation(If you can give advice on how to optimize it that would be even better)

  1. what is the expression? what is the varchar avg size
  2. what is the current data size and what is the expected latency/qps
  3. waht is the estimated cardinality of your data?

@JackTan25
Copy link
Author

well, can you give the code link of string filter? I hope I can check the core. cc @xiaofan-luan

@JackTan25
Copy link
Author

The data size is 330922 rows. And I use multi vector search with string filter (0.9 selectivity), but get slow performance. It costs me 3.23s, and the pure multi vector search (without filter) will cost 2.484s. Wait me to upload the script.(btw, the avg string size is 254 of data size),and I use string filter as not like '%xxxxx%' the avg of string size is 15.

@JackTan25
Copy link
Author

JackTan25 commented Jun 7, 2024

@xiaofan-luan Hi, Can you tell me the core code of string filter in Milvus Project? I would like to study it and try optimize it.

@xiaofan-luan
Copy link
Collaborator

The data size is 330922 rows. And I use multi vector search with string filter (0.9 selectivity), but get slow performance. It costs me 3.23s, and the pure multi vector search (without filter) will cost 2.484s

Pure multi vector search (without filter) will cost 2.484s this doesn't seem to make any sense. If you create the right index, milvus usually takes 10ms in memory or 50 ms with disk.

My suggestion is to not start from code, but from profiling, see what part takes most of your cpu.

And may I know what cpu you are running on? intel or arm? how many cores? what index did you use?

@xiaofan-luan
Copy link
Collaborator

what is the milvus version? Honestly speaking I don't expect anyone with knowhere can optimize milvus in 1 months.
All the filtering code is under internal/core/src in case you are interested.

@xiaofan-luan
Copy link
Collaborator

you can try to create an tantivy index, which may help to improve the like expression.
Like is a regular expression processing so it is slow anyway. Use exact match could be much faster

@JackTan25
Copy link
Author

well, I use not like '%xxxxxx%', and I can't find the api to create tantivy index in milvus doc. Does milvus generate bitmap to do filter search?? cc @xiaofan-luan

@xiaofan-luan
Copy link
Collaborator

  1. what is the expression you are using?
  2. https://milvus.io/docs/index-scalar-fields.md#Custom-indexing inverted index and here is the example https://github.com/milvus-io/pymilvus/blob/2.4/examples/inverted_index_example.py

like is always slow, no matter what database you are using, espeicially when you don't specify prefix

Copy link

stale bot commented Jul 14, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jul 14, 2024
@stale stale bot closed this as completed Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Issues or changes related to enhancement stale indicates no udpates for 30 days
Projects
None yet
Development

No branches or pull requests

2 participants