Name	Name	Last commit message	Last commit date
parent directory ..
air_bench	air_bench
beir	beir
miracl	miracl
mkqa	mkqa
mldr	mldr
msmarco	msmarco
mteb	mteb
README.md	README.md

Evaluation

After fine-tuning the model, it is essential to evaluate its performance. To facilitate this process, we have provided scripts for assessing the model on various datasets. These datasets include: MTEB, BEIR, MSMARCO, MIRACL, MLDR, MKQA, AIR-Bench, and your custom datasets.

To evaluate the model on a specific dataset, you can find the corresponding bash scripts in the respective folders dedicated to each dataset. These scripts contain the necessary commands and configurations to run the evaluation process.

This document serves as an overview of the evaluation process and provides a brief introduction to each dataset.

In this section, we will first introduce the commonly used arguments across all datasets. Then, we will provide a more detailed explanation of the specific arguments used for each individual dataset.

1. Introduction
- (1) EvalArgs
- (2) ModelArgs
2. Usage

Introduction

1. EvalArgs

Arguments for evaluation setup:

eval_name: Name of the evaluation task (e.g., msmarco, beir, miracl).
dataset_dir: Path to the dataset directory. This can be:
1. A local path to perform evaluation on your dataset (must exist). It should contain:
  - corpus.jsonl
  - <split>_queries.jsonl
  - <split>_qrels.jsonl
2. Path to store datasets downloaded via API. Provide None to use the cache directory.
force_redownload: Set to True to force redownload of the dataset. Default is False.
dataset_names: List of dataset names to evaluate or None to evaluate all available datasets. This can be the dataset name (BEIR, etc.) or language (MIRACL, etc.).
splits: Dataset splits to evaluate. Default is test.
corpus_embd_save_dir: Directory to save corpus embeddings. If None, embeddings will not be saved.
output_dir: Directory to save evaluation results.
search_top_k: Top-K results for initial retrieval. Default is 1000.
rerank_top_k: Top-K results for reranking. Default is 100.
cache_path: Cache directory for datasets. Default is None.
token: Token used for accessing the private data (datasets/models) in HF. Default is None, which means it will use the environment variable HF_TOKEN.
overwrite: Set to True to overwrite existing evaluation results. Default is False.
ignore_identical_ids: Set to True to ignore identical IDs in search results. Default is False.
k_values: List of K values for evaluation (e.g., [1, 3, 5, 10, 100, 1000]). Default is [1, 3, 5, 10, 100, 1000].
eval_output_method: Format for outputting evaluation results (options: 'json', 'markdown'). Default is markdown.
eval_output_path: Path to save the evaluation output.
eval_metrics: Metrics used for evaluation (e.g., ['ndcg_at_10', 'recall_at_10']). Default is [ndcg_at_10, recall_at_100].

2. ModelArgs

Arguments for Model Configuration:

embedder_name_or_path: The name or path to the embedder.
embedder_model_class: Class of the model used for embedding (current options include 'encoder-only-base', 'encoder-only-m3', 'decoder-only-base', 'decoder-only-icl'.). Default is None. For the custom model, you should set this argument.
normalize_embeddings: Set to True to normalize embeddings.
pooling_method: The pooling method for the embedder.
use_fp16: Use FP16 precision for inference.
devices: List of devices used for inference.
query_instruction_for_retrieval, query_instruction_format_for_retrieval: Instructions and format for query during retrieval.
examples_for_task, examples_instruction_format: Example tasks and their instructions format.
trust_remote_code: Set to True to trust remote code execution.
reranker_name_or_path: Name or path to the reranker.
reranker_model_class: Reranker model class (options include 'encoder-only-base', 'decoder-only-base', 'decoder-only-layerwise', 'decoder-only-lightweight'). Default is None. For the custom model, you should set this argument.
reranker_peft_path: Path for portable encoder fine-tuning of the reranker.
use_bf16: Use BF16 precision for inference.
query_instruction_for_rerank, query_instruction_format_for_rerank: Instructions and format for query during reranking.
passage_instruction_for_rerank, passage_instruction_format_for_rerank: Instructions and format for processing passages during reranking.
cache_dir: Cache directory for models.
embedder_batch_size, reranker_batch_size: Batch sizes for embedding and reranking.
embedder_query_max_length, embedder_passage_max_length: Maximum length for embedding queries and passages.
reranker_query_max_length, reranker_max_length: Maximum lengths for reranking queries and reranking in general.
normalize: Normalize the reranking scores.
prompt: Prompt for the reranker.
cutoff_layers, compress_ratio, compress_layers: arguments for configuring the output and compression of layerwise or lightweight rerankers.

Notice: If you evaluate your own model, please set embedder_model_class and reranker_model_class.

Usage

Requirements

You need install pytrec_eval and faiss for evaluation:

pip install pytrec_eval
pip install https://github.com/kyamagu/faiss-wheels/releases/download/v1.7.3/faiss_gpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

1. MTEB

For MTEB, we primarily use the official MTEB code, which only supports the assessment of embedders. Moreover, it restricts the output format of the evaluation results to JSON. We have introduced the following new arguments:

languages: Languages to evaluate. Default: eng
tasks: Tasks to evaluate. Default: None
task_types: The task types to evaluate. Default: None
use_special_instructions: Whether to use specific instructions in prompts.py for evaluation. Default: False
examples_path: Use specific examples in the path. Default: None