This repository contains a set of Jupyter notebooks that walk you through the necessary steps to compare two search engines (a BM25-based one and a vector-based one) against each other.
This example compares an Elasticsearch based e-commerce search setting ("Chorus - the Elasticsearch edition") with a Jina AI based vector search setting.
The methodology is explained in the blog post "How to Compare Vector Search to Traditional Search for E-Commerce".
- Jupyter
- Docker
- JCloud account with Jina AI (you can register for free - as this repository is being created - at https://jina.ai/)
trec_eval
command line tool (see https://aldolipani.com/trec_eval-installation-usage-and-behaviour/ for installation help)
The notebooks in the folder notebooks
contain all necessary code with additional explanation. They are enumerated in the order you should execute them to follow the methodolody.
This notebook deploys a flow to JCloud, downloads the product data, creates embeddings and indexes it.
Queries are extracted from the provided ratings file and the ratings are transformed to the format trec_eval
can work with.
Results from the Jina AI setting are retrieved from JCloud.
Results from the Elasticsearch based setting are retrieved from Chorus, the Elasticsearch edition.
The two result sets are compared via trec_eval
. The metric used is nDCG@10.
Check the trec_eval
results for statistical significance.
- Clone this repository
git clone https://github.com/o19s/vector-search-evaluation.git
- Clone the Chorus (Elasticsearch edition) repository
git clone https://github.com/querqy/chorus-elasticsearch-edition.git
- Run the quickstart script to have Chorus up and running
cd chorus-elasticsearch-edition
./quickstart.sh
Alternatively, you can run the quickstart with the option -lab
and go through the first Kata to get familiar with the Chorus stack. It guides you through the steps to optimize a query via search management:
./quickstart.sh -lab
- Run Jupyter in the repository directory
cd ../vector-search-evaluation
jupyter notebook
- Access the first notebook and run through the cells
By default, Jupyter runs on port 8888, so go visit Jupyter and navigate to the first notebook (at http://localhost:8888/notebooks/notebooks/1.%20JCloud%20Deployment.ipynb if you're running Jupyter on the default port)