Benchmarking library and reproduction code for PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models
This repository contains the code and datasets for benchmarking a paraphrase detector, as described in our COLING 2025 paper PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models. It also has the scripts that allow reproduction and extension of the results that are displayed in the paper.
To evaluate a model on the full PARAPHRASUS benchmark, you simply need to wrap it into a binary prediction method that accepts a list with pairs of texts, and returns a list with boolean True/False predictions.
For example, here are two dummy prediction methods, predict_method1
and predict_method2
. Then:
from benchmarking import bench
def predict_method1(pairs):
return [False for _ in pairs]
def predict_method2(pairs):
return [False for _ in pairs]
methods = {
"m1": predict_method1,
"m2": predict_method2
}
bench(methods, bench_id="mybench")
Running by configuration is also supported. Assuming the dummy prediction functions above are located at the file my_funcs.py, a configuration file should look like this:
{
"bench_id": "mybench",
"methods": [
{
"name": "m1",
"module": "my_funcs",
"function": "predict_method1"
},
{
"name": "m2",
"module": "my_funcs",
"function": "predict_method2"
}
]
}
Then, assuming the above configuration is the local file my_config.json, one can run the benchmark like so:
python3 benchmarking.py my_config.json
Finally, the results can be extracted by running:
python3 extract_results.py my_config.json
which will save the error rates at: benches/mybench/results.json
- Overview
- Repository Organization
- Reproducing the Experiments
- Further Experimentation
- BibTeX Reference
- Datasets and Licenses
- Further Support
- About Impresso
This repository allows replication of the experiments from the research titled "PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models" and is extendable to allow further experimentation, qualitative analysis. It includes:
- Original predictions generated using the models described in the paper, useful for further qualitative or quantitative analysis.
- Scripts and configuration files to reproduce the results.
- Utility scripts to reproduce statistics plots and so on.
The repository is organized as follows:
├── original_reproduction_code
│ └── The initial version of the repository.
├── datasets_no_results
│ └── Contains the datasets used in the experiments, in a JSON format. Copied for every new benchmark
├── models
│ └── Empty models file used in the experiments.
├── benchmarking.py
│ └── Main benchmarking code, for running predictions on the datasets using specified methods.
├── extract_results.py
│ └── Script for extracting result of a benchmark.
├── lm_studio_templates
| └──templates.py
| └──Sample functions for making prediction functions using LM Studio
| └──paper_methods.py
| └──methods used in the paper to run benchmarks using LM Studio
| └──l70b_methods.py
| └──methods to run the benchmark using Llama3.3 70b Q8 and 8b Q4
├── logger.py
│ └── Utility for managing logging: all events are logged both to stdout and to a local logs.log file.
├── paper_config.json
│ └── Benchmark configuration for the methods used in the paper.
├── llama3_3_70b_config.json
│ └── Benchmark configuration for running the benchmark using Llama3.3 70b Q8 and 8b Q4
Predictions using the LLMs in the paper can be run locally (provided LM Studio is running and serving the model meta-llama-3-8b-instruct (Meta-Llama-3-8B-Instruct-Q4_K_M.gguf)) like so:
python3 benchmarking.py paper.json
The predictions of the methods mentioned in the paper are given as a benchmark with the identifier 'paper'. That means, the results (error rates) can be extracted like so:
python3 extract_results.py paper
At benches/paper the file results.json is generated:
{
"Classify!": {
"PAWSX": {
"XLM-RoBERTa-EN-ORIG": "15.2%",
"Llama3 zero-shot P1": "44.7%",
"Llama3 zero-shot P2": "40.7%",
"Llama3 zero-shot P3": "38.1%",
"Llama3 ICL_4 P1": "39.0%",
"Llama3 ICL_4 P2": "34.1%",
"Llama3 ICL_4 P3": "33.2%"
},
"STS-H": {
"XLM-RoBERTa-EN-ORIG": "54.1%",
"Llama3 zero-shot P1": "56.2%",
"Llama3 zero-shot P2": "37.6%",
"Llama3 zero-shot P3": "41.7%",
"Llama3 ICL_4 P1": "44.7%",
"Llama3 ICL_4 P2": "41.7%",
"Llama3 ICL_4 P3": "39.1%"
},
"MRPC": {
"XLM-RoBERTa-EN-ORIG": "33.4%",
"Llama3 zero-shot P1": "23.6%",
"Llama3 zero-shot P2": "45.9%",
"Llama3 zero-shot P3": "37.5%",
"Llama3 ICL_4 P1": "33.2%",
"Llama3 ICL_4 P2": "45.2%",
"Llama3 ICL_4 P3": "46.7%"
}
},
"Minimize!": {
"SNLI": {
"XLM-RoBERTa-EN-ORIG": "32.4%",
"Llama3 zero-shot P1": "7.3%",
"Llama3 zero-shot P2": "1.0%",
"Llama3 zero-shot P3": "1.3%",
"Llama3 ICL_4 P1": "1.9%",
"Llama3 ICL_4 P2": "0.8%",
"Llama3 ICL_4 P3": "0.5%"
},
"ANLI": {
"XLM-RoBERTa-EN-ORIG": "7.2%",
"Llama3 zero-shot P1": "13.0%",
"Llama3 zero-shot P2": "1.2%",
"Llama3 zero-shot P3": "1.7%",
"Llama3 ICL_4 P1": "2.0%",
"Llama3 ICL_4 P2": "0.8%",
"Llama3 ICL_4 P3": "0.8%"
},
"XNLI": {
"XLM-RoBERTa-EN-ORIG": "26.7%",
"Llama3 zero-shot P1": "12.3%",
"Llama3 zero-shot P2": "1.4%",
"Llama3 zero-shot P3": "1.3%",
"Llama3 ICL_4 P1": "2.8%",
"Llama3 ICL_4 P2": "0.3%",
"Llama3 ICL_4 P3": "0.3%"
},
"STS": {
"XLM-RoBERTa-EN-ORIG": "46.6%",
"Llama3 zero-shot P1": "12.9%",
"Llama3 zero-shot P2": "2.4%",
"Llama3 zero-shot P3": "3.5%",
"Llama3 ICL_4 P1": "3.5%",
"Llama3 ICL_4 P2": "3.1%",
"Llama3 ICL_4 P3": "2.4%"
},
"SICK": {
"XLM-RoBERTa-EN-ORIG": "37.0%",
"Llama3 zero-shot P1": "0.9%",
"Llama3 zero-shot P2": "0.1%",
"Llama3 zero-shot P3": "0.0%",
"Llama3 ICL_4 P1": "0.3%",
"Llama3 ICL_4 P2": "0.0%",
"Llama3 ICL_4 P3": "0.0%"
}
},
"Maximize!": {
"TRUE": {
"XLM-RoBERTa-EN-ORIG": "31.4%",
"Llama3 zero-shot P1": "9.0%",
"Llama3 zero-shot P2": "34.7%",
"Llama3 zero-shot P3": "35.3%",
"Llama3 ICL_4 P1": "29.9%",
"Llama3 ICL_4 P2": "40.1%",
"Llama3 ICL_4 P3": "50.9%"
},
"SIMP": {
"XLM-RoBERTa-EN-ORIG": "5.3%",
"Llama3 zero-shot P1": "14.7%",
"Llama3 zero-shot P2": "47.3%",
"Llama3 zero-shot P3": "37.5%",
"Llama3 ICL_4 P1": "33.3%",
"Llama3 ICL_4 P2": "42.3%",
"Llama3 ICL_4 P3": "45.5%"
}
},
"Averages": {
"XLM-RoBERTa-EN-ORIG": {
"Classify!": "34.2%",
"Minimize!": "30.0%",
"Maximize!": "18.3%",
"Overall Average": "27.5%"
},
"Llama3 zero-shot P1": {
"Classify!": "41.5%",
"Minimize!": "9.3%",
"Maximize!": "11.8%",
"Overall Average": "20.9%"
},
"Llama3 zero-shot P2": {
"Classify!": "41.4%",
"Minimize!": "1.2%",
"Maximize!": "41.0%",
"Overall Average": "27.9%"
},
"Llama3 zero-shot P3": {
"Classify!": "39.1%",
"Minimize!": "1.6%",
"Maximize!": "36.4%",
"Overall Average": "25.7%"
},
"Llama3 ICL_4 P1": {
"Classify!": "39.0%",
"Minimize!": "2.1%",
"Maximize!": "31.6%",
"Overall Average": "24.2%"
},
"Llama3 ICL_4 P2": {
"Classify!": "40.3%",
"Minimize!": "1.0%",
"Maximize!": "41.2%",
"Overall Average": "27.5%"
},
"Llama3 ICL_4 P3": {
"Classify!": "39.7%",
"Minimize!": "0.8%",
"Maximize!": "48.2%",
"Overall Average": "29.6%"
}
}
}
You can run your own experiments using any prediction methods of your choosing.
If you would like to cite this project, or the associated paper, here's a bibtex:
@misc{michail2024paraphrasuscomprehensivebenchmark,
title = {PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models},
author = {Andrianos Michail and Simon Clematide and Juri Opitz},
year = {2024},
eprint = {2409.12060},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2409.12060},
}
This repository inherits its license from the original release, and all datasets used are publicly available from the following links (among many others):
-
PAWS-X
Link: PAWS-X Dataset -
SICK-R
Link: SICK-R Dataset -
MSRPC
Link: Microsoft Research Paraphrase Corpus -
XNLI
Link: XNLI Dataset -
ANLI
Link: Adversarial NLI (ANLI) -
Stanford NLI (SNLI)
Link: SNLI Dataset -
STS Benchmark
Link: STS Benchmark -
OneStopEnglish Corpus
Link: OneStopEnglish Corpus
Within this work, we introduce a dataset (and an annotation on an existing one) which are also available within our repository under the same license as the source dataset
-
AMR True Paraphrases Source: AMR GUIDELINES Dataset: AMR-True-Paraphrases.
-
STS Benchmark (Scores 4-5) (STS-H) with Paraphrase Label Link: STS Hard
This repository's benchmarking codebase was voluntary developed ad-hoc by Andreas Loizides. In the future, we will work towards adding more datasets (also multilingual) and to make the benchmark more compute efficient. If you are interested in contributing or need further support reproducing/recreating/extending the results, please reach out to [email protected].
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585 and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2024 The Impresso team.
This program is provided as open source under the GNU Affero General Public License v3 or later.