MalwareClustering with ApiVector
Starting from pure python, it will be shown multiprocessing, numpy, cython, dask, arriving to dask-cuda with cupy: A NumPy-compatible matrix library accelerated by CUDA. The study explored also differents places to store and retrieve data such as Neo4j, MongoDB, PostgreSQL and different data format like strings, numpy vectors and numpy packbits vectors.
As today we got best results using dask-cuda, cupy and zarr.
Presentation with benchmark and results is available here: https://ldo-cert.github.io/MISP-Summit-05/#/home
language | 1 vs 1 | 1 vs many | many vs many |
---|---|---|---|
python | x | x | x |
numpy | x | x | x |
numexpr | x | x | x |
numba | x | x | x |
pybind11 | x | x | x |
cython | x | x | x |
pythran | x | x | x |
dask | x | x | x |
tensorflow | x | x | x |
dask-cuda with cupy | x | x | x |
data source | size | times |
---|---|---|
Neo4J | x | x |
MongoDB | x | x |
PostgreSQL | x | x |
Zarr | x | x |
data | data type | size | times |
---|---|---|---|
ApiScout | string | x | x |
numpy vector | binary | x | x |
numpy packbits vector | binary | x | x |
zarr arrays | binary | x | x |