Skip to content

Latest commit

 

History

History

multilingual-search

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
#Vespa

Multilingual Search with multilingual embeddings

This sample application demonstrates multilingual search using multilingual embeddings.

Read the blog post.

Quick start

The following is a quick start recipe for getting started with this application.

  • Docker Desktop installed and running. 4 GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
  • Alternatively, deploy using Vespa Cloud
  • Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
  • Architecture: x86_64 or arm64
  • Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.

Validate Docker resource settings, should be minimum 4 GB:

$ docker info | grep "Total Memory"
or
$ podman info | grep "memTotal"

Install Vespa CLI:

$ brew install vespa-cli

For local deployment using docker image:

$ vespa config set target local

Pull and start the vespa docker container image:

$ docker pull vespaengine/vespa
$ docker run --detach --name vespa --hostname vespa-container \
  --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \
  vespaengine/vespa

Verify that configuration service (deploy api) is ready:

$ vespa status deploy --wait 300

Download this sample application:

$ vespa clone multilingual-search my-app && cd my-app

This sample app embedder configuration in services.xml points to a quantized model.

Alternatively, export your own model, see also the export script in simple-semantic-search.

Deploy the application :

$ vespa deploy --wait 300

Deployment note

It is possible to deploy this app to Vespa Cloud.

Evaluation

The following reproduces the results reported on the MIRACL Swahili(sw) dataset.

Install trec_eval:

$ git clone --depth 1 --branch v9.0.8 https://github.com/usnistgov/trec_eval && cd trec_eval && make install && cd ..

Index the dataset, this also embeds the texts and is compute intensive. On an M1 laptop, this step takes about 1052 seconds (125 operations/s).

$ zstdcat ext/sw-feed.jsonl.zst | vespa feed -

The evaluation script queries Vespa (requires pandas and requests libraries):

$ pip3 install pandas requests

E5 multilingual embedding model

Using the multilingual embedding model

$ python3 ext/evaluate.py --endpoint http://localhost:8080/search/ \
 --query_file ext/topics.miracl-v1.0-sw-dev.tsv \
 --ranking semantic --hits 100 --language sw
 

Compute NDCG@10 using trec_eval with the dev relevance judgments:

$ trec_eval -mndcg_cut.10 ext/qrels.miracl-v1.0-sw-dev.tsv semantic.run

Which should produce the following:

ndcg_cut_10           	all 	0.6848

BM25

Using traditional keyword search with BM25 ranking:

$ python3 ext/evaluate.py --endpoint http://localhost:8080/search/ \
 --query_file ext/topics.miracl-v1.0-sw-dev.tsv \
 --ranking bm25 --hits 100 --language sw
 

Compute NDCG@10 using trec_eval with the same relevance judgments:

$ trec_eval -mndcg_cut.10 ext/qrels.miracl-v1.0-sw-dev.tsv bm25.run
ndcg_cut_10           	all	0.424

Cleanup

Tear down the running container:

$ docker rm -f vespa