A Comprehensive Survey and Experimental Comparison of Graph-based Approximate Nearest Neighbor Search

Introduction

Approcimate Nearest Neighbor Search (ANNS) is a fundamental building block in various application domains. Recently, graph-based algorithms have emerged as a very effective choice to implement ANNS. Our paper (arXiv link, PDF) provides a comprehensive comparative analysis and experimental evaluation of representative graph-based ANNS algorithms on carefully selected datasets of varying sizes and characteristics.

This project contains the code, dataset, optimal parameters, and other detailed information used for the experiments of our paper. It is worth noting that we reimplement all algorithms based on exactly the same design pattern, programming language (except for the hash table in IEH) and tricks, and experimental setup, which makes the comparison more fair.

Algorithms

we evaluate thirteen representative graph-based ANNS algorithms, and their papers and the original codes are given in the following table.

ALGORITHM	PAPER	CODE
KGraph	WWW'2011	C++/Python
FANNG	CVPR'2016	-
NSG	VLDB'2019	C++
NSSG	TPAMI'2021	C++/Python
DPG	TKDE'2019	C++
Vamana	NeurIPS'2019	-
EFANNA	arXiv'2016	C++/MATLAB
IEH	IEEE T CYBERNETICS'2014	-
NSW	IS'2014	C++/Python
HNSW	TPAMI'2018	C++/Python
NGT-panng	SISAP'2016	C++/Python
NGT-onng	arXiv'2018	C++/Python
SPTAG-KDT	ACM MM'2012; CVPR'2012; TPAMI'2014	C++
SPTAG-BKT	ACM MM'2012; CVPR'2012; TPAMI'2014	C++
HCNNG	PR'2019	-
KDRG	SIGKDD'2011	-

Datasets

Our experiment involves eight real-world datasets popularly deployed by existing works. All datasets are pre-split into base data and query data and come with groundtruth data in the form of the top 20 or 100 neighbors. Additional twelve synthetic datasets are used to test the scalability of each algorithm to the performance of different datasets.

Note that, all base data and query data are converted to fvecs format, and groundtruth data is converted to ivecs format. Please refer here for the description of fvecs and ivecs format. All datasets in this format can be downloaded from here.

Parameters

For the optimal parameters of each algorithm on all experimental datasets, see the parameters page.

Usage

Prerequisites

GCC 4.9+ with OpenMP
CMake 2.8+
Boost 1.55+

Compile on Linux

$ git clone https://github.com/Lsyhprum/WEAVESS.git
$ cd WEAVESS/
$ mkdir build && cd build/
$ cmake ..
$ make -j

Index construction evaluation

Before building index, you should set the root directory for the dataset in WEAVESS/test/main.cpp first. Then, you can run the following instructions for build graph index.

cd WEAVESS/build/test/
./main algorithm_name dataset_name build

After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. You can run the following command in current directory for getting other index information, such as average out-degree, graph quality, and the number of connected components.

./main algorithm_name dataset_name info

Search performance

With the index built, you can run the following commands to perform the search. Related information about the search such as search time, distance evaluation times, candidate set size, average query path length, memory load can be obtained or calculated according to the output log information.

cd WEAVESS/build/test/
./main algorithm_name dataset_name search

Components evaluation

Note that the default dev branch is the evaluation of the overall performance of all algorithms, and the evaluation of a certain component needs to be carried out under the test branch. For more details, please see our paper.

Machine learning-based optimizations

See here for details.

Reference

Please cite our work in your publications if it helps your research:

@misc{wang2021comprehensive,
    title={A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search},
    author={Mengzhao Wang and Xiaoliang Xu and Qiang Yue and Yuxiang Wang},
    year={2021},
    eprint={2101.12631},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

Acknowledgements

Thanks to everyone who provided references for this project. Special thanks to Dr. Weijie Zhao, Dr. Mingjie Li, and Dr. Cong Fu for their assistance in the necessary implementation of this project.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.idea		.idea
algo		algo
cmake/module		cmake/module
dataset		dataset
include/weavess		include/weavess
ml		ml
parameters		parameters
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Comprehensive Survey and Experimental Comparison of Graph-based Approximate Nearest Neighbor Search

Introduction

Algorithms

Datasets

Parameters

Usage

Prerequisites

Compile on Linux

Index construction evaluation

Search performance

Components evaluation

Machine learning-based optimizations

Reference

Acknowledgements

About

Releases

Packages

Languages

License

wenyangzz/WEAVESS

Folders and files

Latest commit

History

Repository files navigation

A Comprehensive Survey and Experimental Comparison of Graph-based Approximate Nearest Neighbor Search

Introduction

Algorithms

Datasets

Parameters

Usage

Prerequisites

Compile on Linux

Index construction evaluation

Search performance

Components evaluation

Machine learning-based optimizations

Reference

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages