You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know this is a reference implementation, but it's the one that uses the iscc package name on PyPI and as such will be the most used. Thus, here's a big performance suggestion.
The bottleneck in hashing long texts is minimum_hash. This is a much faster version. Compile with
CFLAGS="-O3" cythonize -a -i utils.pyx
This is the code:
# cython: language_level=3
from cython.view cimport array
MINHASH_PERMUTATIONS = [
# Copy values from iscc/const.py
]
cpdef unsigned int[:] cython_minimum_hash(unsigned int[:] features, int n=64):
cdef unsigned long max_int64 = (1 << 64) - 1
cdef unsigned long mersenne_prime = (1 << 61) - 1
cdef unsigned long max_hash = (1 << 32) - 1
cdef unsigned long a, b, min_val
cdef unsigned int[:] result = array(shape=(n,), itemsize=sizeof(unsigned int), format="I")
cdef int i, j
for i in range(n):
a, b = MINHASH_PERMUTATIONS[i]
min_val = max_int64
for j in range(features.shape[0]):
v = ((a * features[j] + b) & max_int64) % mersenne_prime
min_val = min(min_val, v & max_hash)
result[i] = <unsigned int>min_val
return result
The last function is a good place to insert comparison of the results with the slow version. I hashed hundreds of long documents to check they are identical.
The text was updated successfully, but these errors were encountered:
We have added some basic optional Cython support there. Pull requests for performance improvement are very welcome. An up-to-date higher level lib with content extraction support can be found here: https://github.com/iscc/iscc-sdk.
I know this is a reference implementation, but it's the one that uses the
iscc
package name on PyPI and as such will be the most used. Thus, here's a big performance suggestion.The bottleneck in hashing long texts is minimum_hash. This is a much faster version. Compile with
This is the code:
You can also use setup.py to build the module:
Until it's integrated, I monkey patch it like so:
The last function is a good place to insert comparison of the results with the slow version. I hashed hundreds of long documents to check they are identical.
The text was updated successfully, but these errors were encountered: