Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Add nmslib's bit_hamming spaces into plugin #283

Open
luyuncheng opened this issue Dec 18, 2020 · 1 comment
Open

Add nmslib's bit_hamming spaces into plugin #283

luyuncheng opened this issue Dec 18, 2020 · 1 comment

Comments

@luyuncheng
Copy link

As I see #264 that add Hamming distance in custom scoring it is a great functionality. i see there is bit_hamming space space_bit_hamming in nmslib. i think may be we could add this into plugin.

i refer to the code space_bit_hamming and space_bit_hamming_test, may be we could add "SpaceBitVector" into plugin and support bit_hamming space which is no optimized index.

i also refer to PR: #161 which add no optimized index for "negdotprod", i see the nmslib's python_binding code python_binding_nmslib, may be we could add a "save_data" into plugin and can store index and dataset for "no optimized index".

so i submit a PR for this.

@vamshin
Copy link
Member

vamshin commented Jan 6, 2021

Hi @luyuncheng,

We have few concerns with the non optimized index for Hamming distance in nmslib. In Elasticsearch we would store serialized graph per segment which means one additional file per knn_vector field . For non optimized indices like Hamming we will end up having 2 files per segment, one to store the graph and one for the data(elements in the graph). So for large data set, it is possible to end up with large number of segments which could potentially exhaust file descriptors and run into issues of no available file descriptors. The Pr you mentioned #161 is put into hold for the very same reason. We worked with nmslib team to make optimized index for negative dot product to have one file per segment. We will have a new PR which would enable negative dot product with optimized index.

There are couple of suggestions

  1. Enable optimized index support for Hamming in nmslib and then incorporated the changes in k-NN plugin
  2. Make use of custom scoring feature for Hamming.

How about you start with the 2nd approach and let us know if you see any performance concerns with custom scoring for Hamming?

We could then take a call about having optimized/non optimized Hamming index? Till then we would like to keep your PR(#284) for hamming support on hold.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants