- Recall: retrieve a few thousand relevant items from the whole item pool.
- Pre-Ranking: get around a few hundred items. A lightweight model.
- Ranking: ****sophisticated ranking models to get around the top 50 items.
- Reranking: for extra business flexibility and performance.
recall
, also known as sensitivity, measures the proportion of actual positive cases that were correctly identified by the model To ensure that the search query captures as many relevant documents as possible, even at the expense of retrieving some non-relevant ones
In simple terms, it answers the question: "Out of all the relevant items, how many did the system successfully find?”
precision
answers"Out of all the items the system identified as relevant, how many were actually relevant?”
Pre-ranking, or coarse ranking, narrows down the candidate set from potentially millions of items to a smaller subset (e.g., hundreds or thousands) that are more likely to be relevant. This step balances between efficiency and effectiveness, ensuring that the computational heavy lifting in the next steps is manageable.
- Dense Retrieval: This approach uses deep learning models to encode both queries and documents into dense vectors in the same embedding space. The relevance of a document to a query is then estimated by the similarity between their vectors, often calculated with dot product or cosine similarity.
- Two-stage Retrieval Systems: These systems initially retrieve a broad set of documents using a simpler, faster model (like BM25 or a basic embedding model) and then apply a more complex model to refine this set. The first stage emphasizes recall, while the second stage focuses on precision.
The ranking stage involves evaluating the relevance of each item in the pre-ranked subset to the user's query with more sophisticated algorithms, often employing machine learning models. This stage aims to order the items by relevance as accurately as possible.
- Multiple Negative Ranking Loss: This is a training strategy used in machine learning models for ranking. The model is trained on positive pairs (a query and its correct response) and several negative pairs (the same query paired with incorrect responses). The loss function is designed to increase the model's output for positive pairs and decrease it for negative pairs, effectively teaching the model to distinguish between relevant and irrelevant items more accurately.
Re-ranking is the final adjustment of the results before presenting them to the user. It might take into account additional context, user preferences, diversity requirements, or freshness of the content. This stage fine-tunes the list of items to ensure the top results are the most relevant and appealing to the user.
- Relevance Scoring: In re-ranking, each item-query pair is assigned a relevance score, which may incorporate a wide range of signals, including user engagement metrics, semantic relevance, and personalization factors. Advanced machine learning models, such as gradient-boosted decision trees or deep neural networks, can be used for scoring.
Discounted cumulative gain (DCG) is a measure of ranking quality. See wikipedia
where
where
MAP is a measure of precision averaged across all relevant documents and queries, accounting for the order of the documents, see tutorial.
For a single query, Average Precision (AP) is calculated as:
- P(k) is the precision at cut-off k in the list
- rel(k) is an indicator function equaling 1 if the item at rank k is relevant, otherwise 0
- n is the number of retrieved documents.
For multiple queries, we calculate mean of AP.
See tutorial.
Paper :<Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents>
Training
-
randomly sample 10K queries from the
MS MARCO
training set, and each query is retrieved by BM25 with 20 candidate passages. -
three types of instructions
-
To handle context length problem, use
sliding window
-
Training objectives
RankNet
is a pairwise loss that measures the cor- rectness of relative passage orders. We can construct M (M − 1)/2 pairs.
$$ \mathcal{L}{\text{RankNet}} = \sum{i=1}^{M} \sum_{j=1}^{M} \mathbf{1}_{r_i < r_j} \log(1 + \exp(s_i - s_j)) $$
Benchmark
- TREC : (i) TREC-DL19 contains 43 queries, (ii) TREC-DL20 contains 54 queries
- BEIR: Covid, NFCorpus, Touche, DBPedia, SciFact, Signal, News, Robust04
- Mr.TyDi: uses the first 100 samples in the test set of each language
- topics: Queries used to evaluate IR system performance. Ex:
19335: {'title': 'anthropological definition of environment'}
- Qrels: short for "query relevance judgments". Ex:
{19335: {1017759: '0', 1082489: '0'}}
- 19335 is the query id
- 1017759 and 1082489 are the document IDs
- '0'-'3': relevance score, 0 means no relevance
- topics: Queries used to evaluate IR system performance. Ex:
Paper: <RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses>
Paper :<TSRankLLM: A Two-Stage Adaptation of LLMs for Text Ranking>
- continual pre-training (CPT) on LLMs by using a large-scale weakly-supervised corpus.
- SFT with RankLLaMA
- https://github.com/frutik/awesome-search
- https://github.com/snexus/llm-search
- https://github.com/snexus/llm-search
- Retrieval Augmented Generation with Huggingface Transformers and Ray
- https://www.pinecone.io/learn/series/rag/rerankers/
- https://github.com/ielab/llm-rankers https://www.sbert.net/examples/applications/retrieve_rerank/README.html#re-ranker-cross-encoder
- https://github.com/leptonai/search_with_lepton
- https://txt.cohere.com/using-llms-for-search/
- https://developers.google.com/machine-learning/recommendation
- https://github.com/SciPhi-AI/agent-search