Skip to content

CS410Assignments/MP4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MP4 Neural Ranking Methods

Due Nov 13th, 2022 at 11:59pm

In this MP you will explore how you can use Neural Network based methods for ranking for information retrieval. The goal of this assignment is to get you familiar with some popular approaches appling neural methods for information retrieval. For each of the two portions of the assignment you need to generate candidate rankings and a discussion where you evaluate performance and what factors may be impacting it. To complete this assignment you will submit your code and 3 candidate ranking files and 2 discussion files. Details for what to include in each file can be found in the deliverables portions.

IMPORTANT MAKE SURE TO READ THIS

For each of the candidate ranking files create the ranking for the top 20 documents only! Otherwise your files will be very large! For each candidate ranking you should match the TREC format where each ranking is on its own line with the format #QUERY_ID\t0 DOCUMENTID\tRANK\tSCORE\trun_id. Remeber that query ids start at 1, document IDs at 0 and ranking at 0. The score and run_id do not impact scoring but can be useful for debugging.

Retrieval Using Bi-Encoders and a Vector Database

While term based retireval based methods are effective they are commonly not effective at understanding semantic differences which are common in text. Bi encoders or dual encoders are a popular method because they are incredibly efficient. Using sentence transformers and Approximate Nearest Neighbor Index FAISS you will search semantically on the CS410 document corpus. The goal in this portion of the assignement is to retrieve documents using a bi encoder and vector datasbase and produce a candidate ranking. Please submit a few candidate ranking and associated scores for the queries and collection found in the data folder. What differences in relevance do you see? how wide are the variations? . Your outputed ranking should match the TREC format (#QUERY_ID\t0 DOCUMENTID\tRANK\tSCORE\trunid) and you can use the TRECEVAL notebook to explore how well diferent models perform.

To help you explore how bi-encoders work and how they can be used with vector databases checkout ot the Bi-Encoder notebook. Details on how to format submissions and evaluate them can be found in the Evaluation notebook. To avoid long inference times we have gone ahead and generate a few sets of embeddings for the document and query corpus. If you want to explore further and look for other models they can be found in the sentence transformer library.

Deliverables

For this portion of the assignment you will submit 2 things: a candidate ranking file matching the TREC format discussed above, a brief paragraph where you discuss what model or representation you used, its performance, and how its relevance varies with depth. For each of the candidate ranking files create the ranking for the top 20 documents only! The files for this part of the submission should be named as follows:

  1. MP4.1-candidate-ranking.trec
  2. MP4.1-discussion.txt

Reranking using Cross Encoders

While bi-encoders can be effective they do not always produce the most relevant candidate sets. A common approach to improve this is to use a cross encoder to rerank the candidate sets generated by either bi-encoders or term based systems like BM25. A brief summary on how to use a cross encoder can be found in CrossEncoder notebook. In part 1 you generated ranking candidates using bi-encoders. Using that set, and the BM25 candidates found in the data/bm25-top1000 , you will improve the ranking using a cross encoders. For this portion of the assignment you will need to write a reranking function that uses a cross encoder to rerank a given set of documents and a query. Running a cross encoder is more computationally expensive than a bi-encoder so pick a resonable depth which you will rerank the candidates. Try out some of the cross encoders found in the hub. How does performance differ? How does the size of the reranking set impact how well cross encoders work?

Deliverables

For this portion of the assignment you will submit two candidate TREC ranking files: bi-encoder-rerank, bm25rerank, and a brief paragraph that discusses how you went about reranking inputs, what models you used, how reranking impacted performance and any similarities or difference you see between re-ranking of bi-encoders and BM25. The files for this part of the submission should be named as follows:

  1. MP4.2-bi-encoder-candidate-reranking.trec
  2. MP4.2-bm25-candidate-reranking.trec
  3. MP4.2-discussion.txt

Tips and Suggestions

To learn more about the how to use ANN retrieval we suggest checking out this demo. Another great resource is Pinecone

To learn more about the models which can be used for semantic search check out SBERT

To learn more about TREC Eval and the ir-measure wrapper check out their website

The BM25 candidate set was generated using pyserini. It is one of the most popular python based search libraries out there. If you want to explore any experiments or how ranking methods work together check out the examples in their repo.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published