Skip to content

Latest commit

 

History

History
177 lines (113 loc) · 7.88 KB

lab2_sift1b_100m.md

File metadata and controls

177 lines (113 loc) · 7.88 KB

Lab Test 2: 100-Million-Scale Vector Similarity Search

1. Prepare test data and scripts

The 100 million vectors used in this test are extracted from the dataset SIFT1B. The following hardware configurations were measured to successfully complete the experiment.

Component Minimum Config
OS Ubuntu LTS 18.04
CPU Intel(R) Xeon(R) Platinum 8163 CP
GPU Nvidia GeForce GTX 1060, 6GB GDDR5
GPU Driver CUDA 10.2, Driver 440.100
Memory 755GB DDR4
Hard Disk 1.9T

Download the following data and scripts, and save them to a file named milvus_sift100m.

When it is done, there should be the following files in milvus_sift100m:

  1. The bvecs_data file containing 100 million vectors
  2. The query.npy file that has 10,000 query vectors
  3. The ground_truth.txt file with the top 1000 most similar results for each query vector
  4. The test script files : main.pymilvus_toolkit.pymilvus_load.pyconfig.py

Note: Please go through the README carefully before testing with script . Make changes to the parameters in the script to match your scenario.

2. Configure Milvus parameters

To optimize Milvus's performance, you can change system parameters to suit your requirements. In this test, 90% recall rate can be achieved by using the recommended values in below table.

Configuration file: /home/$USER/milvus/conf/server_config.yaml

Parameter Recommended value
cache.cache_size 25
gpu.cache_size 4
gpu_search_threshold 1001
search_devices -gpu0

Refer to Milvus Configuration for more information.

Use default values for other parameters. After setting parameter values, restart Milvus Docker to apply all changes.

$ docker restart <container id>

3. Create a table and build indexes

Make sure Milvus is already installed and started. (For details of Milvus installation, please read Milvus Quick Start).

Before testing, please modify the corresponding parameters according to the script instructions

Go to milvus_sift1m, and run the following command to create a table and build indexes:

$ python3 main.py --collection ann_100m_sq8 --dim 128 -c
$ python3 main.py --collectio ann_100m_sq8q8 --index sq8 --build 

Vectors are then inserted into a table named ann_100m_sq8h, with the index_type of IVF_SQ8H.

To show the available tables and number of vectors in each table, use the following command:

#See which tables are in the library
$ python3 main.py --show
#View the number of rows in table ANN_1m_sq8h
$ python3 main.py --collection ann_1m_sq8 --rows

4. Import data

Make sure table ann_100m_sq8 is successfully created.

In this project, due to the large amount of data, the downloaded data sets are in uint8 format, so you need to modify the parameter IS_UINT8 in config.py to True before running

Run the following command to import 100m rows of data:

$ python3 main.py --collection=ann_100m_sq8 --load

You can see that all data is imported from the file for once.

Run the following command to check the number of rows in the table:

$ python3 main.py --collection=ann_100m_sq8 --rows

To make sure that all data imported to Milvus has indexes built. Navigate to /home/$USER/milvus/db and enter the following command:

$ sqlite3 meta.sqlite

In sqlite3 CLI, enter the following command to check the current status:

sqlite> select * from collections;

Exit sqlite CLI:

sqlite> .quit

Enter milvus_sift1m and run the following command:

$ python3 main.py --collection=ann_100m_sq8 --index=sq8 --build 

After manually building indexes, enter sqlite CLI again and make sure that index building has been completed for all shards. To understand the meanings of other columns, navigate to /home/$USER/milvus/db and enter the following command in the sqlite CLI:

$ sqlite3 meta.sqlite
sqlite>.schema

5. Accuracy test

SIFT1B provides not only the vector dataset to search 10,000 vectors, but also the top 1000 ground truth for each vector, which allows convenient calculation of precision rate. The vector search accuracy of Milvus can be represented as follows:

Accuracy = Number of shared vectors (between Milvus search results and Ground truth) / (query_records * top_k)

Run query script

Before the accuracy test, you need to manually create the directory recall_result / recall_compare_out to save the test results. To test the search precision for top1(top10, top100, top200) results of 500 vectors randomly chosen from the 10,000 query vectors, go to directory milvus_sift1m, and run this command:

$ python3 main.py --collection=ann_100m_sq8 --search_param 128 --recall

Note: search_param is nprobe value. nprobe affects search accuracy and performance. The greater the value, the higher the accuracy, but the lower the performance. In this experiment.

After executing the above command, an ann_sift1m_sq8_128_500_recall.txt text file will be generated in the recall_result folder. The text file records the id and distance of the most similar first 200 vectors corresponding to 500 vectors,Every 200 lines in the text file correspond to a query result of a query. At the same time, multiple texts will be generated under the recall_compare_out file. Taking ann_sift1m_sq8_128_500_100 as an example, this text records the respective corresponding accuracy rates and the total average accuracy rate of the 500 vectors queried when topk = 100.

The accuracy rate has a positive correlation with search parameter nprobe (number of sub-spaces searched). In this test, when the nprobe = 64, the accuracy can reach > 90%. However, as the nprobe gets bigger, the search time will be longer.

Therefore, based on your data distribution and business scenario, you need to edit nprobe to optimize the trade-off between accuracy and search time.

6. Performance test

To test search performance, go to directory milvus_sift1m, and run the following script:

$ python3 main.py --collection=ann_100m_sq8 --search_param 128 --performance

When the execution is completed, a performance folder is generated and includes ann_100m_sq8h_32_output.csv, which includes the running time for topk values with different nq values.

  • nq - the number of query vectors
  • topk - the top k most similar vectors for the query vectors
  • total_time - the total query elapsed time (in seconds)
  • avg_time - the average time to query one vector (in seconds)

Note:

  1. In milvus_toolkit.py, nq is set to be 1, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, respectively, and topk is set to be 1, 20, 50, 100, 300, 500, 800, 1000, respectively.
  2. To run the first vector search, some extra time is needed to load the data (from the disk) to the memory.
  3. It is recommended to run several performance tests continuously, and use the search time of the second run. If the tests are executed intermittently, Intel CPU may downgrade to base clock speed.