-
Notifications
You must be signed in to change notification settings - Fork 3.7k
How to make Faiss run faster
This page contains a few sparse performance tips to make Faiss run faster, including the speed-vs-accuracy tradeoffs where the speed is more important.
Here is a summary of parameters to look at when configuring an index.
Start from picking the index for similarity search or for vector codec purposes.
Additionally, please read the information about additive quantizers.
It is very common to instantiate an index via faiss.index_factory()
call. But this way of instantiating sets the index parameters to safe values, while there are many speed-related parameters. Many of Faiss components may utilize:
- iterative algorithms used at encoding time; the more iterations performed, the better the recall rate is.
- HNSW:
efConstruction
- ResidualQuantizer:
niter_codebook_refine
- LocalSearchQuantizer:
encode_ils_iters
,icm_iters
,train_iters
,encode_ils_iters
(for vector codecs)
- HNSW:
- heuristics that limit the number of possible candidates considered during the training or the search process, such as beam search. The more candidates are evaluated, the better the recall rate is.
- IndexBinaryIVF:
nprobe
,max_codes
- IndexIVF:
nprobe
,max_codes
- ResidualQuantizer:
max_beam_size
- HNSW:
efSearch
- IndexBinaryIVF:
- additional optional versions, such as LUT-based approximations; they allow to trade the speed for accuracy
- IndexIVFPQ:
do_polysemous_training
,polysemous_ht
,use_precomputed_table
- IndexIVFPQFastScan:
use_precomputed_table
- IndexPQ:
do_polysemous_training
,polysemous_ht
,search_type
- ResidualQuantizer:
use_beam_LUT
,approx_topk_mode
,train_type
- LocalSearchQuantizer:
update_codebooks_with_double
- IndexIVFPQ:
- computations that are optimized for a particular case, such as big-batch IVF search.
- various internal buffers for processing an input data in chunks; see below.
- random number generators; these ones are used to ensure that results can be reproduced.
The default index parameters should work reasonably good in the most general cases. But it is highly recommended to study the available parameters in order to find an optimal point in your particular speed-vs-accuracy-vs-RAM case.
The recipe: study the available parameters of a chosen index and its subindices.
K-means clustering is an often used facility inside Faiss. By default, k-means implementation in faiss/Clustering.h uses 25 iterations (niter
parameter) and up to 256 samples from the input dataset per cluster needed (max_points_per_centroid
parameter). The samples are chosen randomly. For example, the default PQx12
training is ~4x slower than PQx10
training and ~16x slower than PQx8
training because of the proportionally higher number of used samples from the input dataset.
IVF-based indices use 10 iterations of k-means by default (faiss/IndexIVF.h, faiss/IndexBinaryIVF.h, faiss/gpu/GpuIndexIVF.cu).
The recipe: play with parameters of k-means clustering niter
and max_points_per_centroid
.
Faiss is optimized for working with batches of samples, rather than processing samples one by one. Internally, Faiss parallelizes over the batch elements in a way that is more efficient than if parallelization was performed by the caller.
The recipe: try to process your samples in batches, if appropriate.
Consider using Train_shared
training mode for PQ, which is used to train a single codebook for all subquantizers, instead of training a separate one for every subquantizer. This significantly reduces the training time for, say, 10-bit or 12-bit PQ, but at the cost of the accuracy.
faiss::IndexPQ indexPQ;
indexPQ.pq.train_type = faiss::ProductQuantizer::Train_shared;
faiss::IndexIVFPQ indexIVFPQ;
indexIVFPQ.pq.train_type = faiss::ProductQuantizer::Train_shared;
These tips are about the environment of Faiss: threading, memory configuration, etc.
Most modern CPU architectures and operating systems support Huge memory pages.
It is a typical situation that Faiss uses the random access pattern to access the needed data, such as codebooks. As a result, TLB cache operates under a very heavy pressure. It was observed that large memory pages helps in certain cases. In particular, the use of 2M / 1G pages allowed to gain up to 20% speed in our certain vector codec experiments on x86-64 platform. Faiss does not contain a built-in support for marking certain allocated memory blocks as huge-pages-are-important-here ones. So, the simplest way to try out huge memory pages in your application is to use a memory allocator that support ones, such as mimalloc.
We did not conduct any experiments with huge memory pages on ARM platform yet.
The recipe: try a memory allocator that supports huge memory pages and try to run your application with ones being enabled. Please be aware that huge memory pages is a scarce resources and it needs to be managed carefully.
Faiss does not have a NUMA-aware code. This means that Faiss does not coordinate memory allocations in order to minimize the traffic between the NUMA nodes. As an example, we had an observation where two runs of the same benchmark on a single 12C/24T NUMA node and on four NUMA nodes on the same machine would yield the same running time!
The recipe: try playing with the number of cpu resources that you allocate for your application.
Faiss uses Intel MKL library, if possible. MKL uses OpenMP library under the hood. According to the MKL manual, it runs the best if the number of used threads equals to the number of physical cores (for example), which is in agreement with our benchmarks. The default setting is that it uses the number of hyperthreads (twice the number of cores).
The recipe: try setting the number of used OMP threads to the number of physical CPU cores, unless you're using a version of Faiss without any multithreading. So, on a 12C/24T machine run your application as OMP_NUM_THREADS=12 my-application
.
If a huge dataset is passed as an input, then it might be possible that the dataset will be split into chunks and processed chunk-by-chunk under the hood. It is possible to speedup the overall processing by increasing the size of a chunk, thus increasing the effectiveness of cache and linear algebra operations. On the other hand, it increases the amount of RAM for the allocated chunks and the time that is needed for such memory allocations. Here is the list of various internal buffer sizes that one may vary:
IndexBinaryFlat.query_batch_size
LocalSearchQuantizer.chunk_size
AdditiveQuantizer.max_mem_distances
faiss.multi_index_quantizer_search_bs
faiss.index2layer_sa_encode_bs
faiss.product_quantizer_compute_codes_bs
faiss.index_ivfpq_add_core_o_bs
faiss.precomputed_table_max_bytes
faiss.hamming_batch_size
faiss.rowwise_minmax_sa_encode_bs
faiss.rowwise_minmax_sa_decode_bs
faiss.distance_compute_blas_threshold
faiss.distance_compute_blas_query_bs
faiss.distance_compute_blas_database_bs
faiss.distance_compute_min_k_reservoir
Faiss uses custom kernels that involve the usage AVX2 or ARM NEON intrinsics. Currently, Faiss does not provide specialized AVX-512 kernels (except for one, which must be enabled manually). We consider supporting AVX-512 more widely, but:
- Faiss may use Intel MKL library (if available) that is capable of using AVX-512 under the hood (if MKL decides to). So, expensive GEMM-related operations are accelerated via AVX-512.
- AVX-512 computations lead to CPU downclocking, which may result in a lower performance (compared vs AVX2 version).
- AVX-512 kernels need developer time and dedication, and kernels may be tricky to create and benchmark.
- AVX-512 would increase the compilation time on our end, because it would introduce a new set of binaries.
As of today (Jan 2023):
- AVX-512 downclocking seems to be fixed in new generations of Intel CPUs.
- AMD Zen4 seems to provide support for AVX-512 instructions.
- bfloat16 instructions come to AVX-512, which can be used for certain cases, such as Additive Quantizers. But the corresponding support needs to be added and benchmarked first.
Faiss may use Intel MKL library, which allows to enable/disable certain instruction sets via the following, if it fits your case and/or if the frequency downclocking is not beneficial.
The only AVX-512 provided custom kernel so far is a fused kernel that combines the L2 distance computation and the closest nearest search neighbor search. This kernel needs to be enabled manually. It may provide some additional minor speedup for PQ training or k-means clustering in case of small dimensionality values (1-16).
(Faiss 1.7.3+)
Allows to trade some very minor amount of RAM for a speedup of pq.search()
call.
The reference in wiki
This can be used for CPU-based IndexPQ and IndexIVFPQ as well, for example:
# train an IVFPQ CPU index
...
# create the tables
index.pq.sync_transposed_centroids()
# do the search
while app_is_running:
D, I = index.search(xq, k)
Fast vector decoding facilities were added in the form of C++ templates for certain codecs.
To speed up training, do not use Polysemous codes for vector codec purposes. For example, use PQ5x12np
instead of PQ5x12
.
There are tradeoffs in terms of code size vs. runtime that can be adjusted at implementation time.
Faiss is implemented in C++ and it heavily utilizes template dispatching for various cases. For example, this is the function from faiss/utils/distances_simd.cpp that computes inner products between a single d
-dimenionsal vector x
and ny
d
-dimensional vectors y
.
void fvec_inner_products_ny(
float* dis,
const float* x,
const float* y,
size_t d,
size_t ny) {
#define DISPATCH(dval) \
case dval: \
fvec_op_ny_D##dval<ElementOpIP>(dis, x, y, ny); \
return;
switch (d) {
DISPATCH(1)
DISPATCH(2)
DISPATCH(4)
DISPATCH(8)
DISPATCH(12)
default:
fvec_inner_products_ny_ref(dis, x, y, d, ny);
return;
}
#undef DISPATCH
}
As one may see, there are specialized cases for the dimensionalities of 1, 2, 4, 8 and 12, which reference to custom kernels, otherwise the implementation relies on a general-purposes code and a compiler.
Custom specialized kernels tend to show better performance than a general-purpose code, but at the cost of the binary size.
Please feel free to add some custom kernels that fit your case, if appropriate, and send us a pull request with your change! :)
Due to the rigid constraints of GPU computations, templates are more heavily used in the CUDA code than in the C++ code.
All the feedback regarding the Faiss execution speed is highly appreciated! The most valuable feedback would be a reproducible standalone python script or a small C++ program that exposes performance hot-spots inside Faiss.
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark