ARQMath-eval

This repository contains code, which you can use to evaluate your system runs from the ARQMath competitions.

Description

Tasks

This repository evaluates the performance of your information retrieval system on a number of tasks:

task1-example – ARQMath Task1 example dataset,
task1-votes – ARQMath Task1 Math StackExchange user votes,
task1, task1-2020 – ARQMath Task1 final dataset,
task1-2021 – ARQMath-2 Task1 final dataset,
ntcir-11-math-2-main – NTCIR-11 Math-2 Task Main Subtask,
ntcir-12-mathir-arxiv-main – NTCIR-12 MathIR Task ArXiv Main Subtask,
ntcir-12-mathir-math-wiki-formula – NTCIR-12 MathIR Task MathWikiFormula Subtask,
task2, task2-2020 – ARQMath Task2 final dataset, and
task2-2021 – ARQMath-2 Task2 final dataset.

The main tasks are:

task1 – Use this task to evaluate your ARQMath task 1 system, and
task2 – Use this task to evaluate your ARQMath task 2 system.

Subsets

Each task comes with three subsets:

train – The training set, which you can use for supervised training of your system.
validation – The validation set, which you can use to compare the performance of your system with different parameters. The validation set is used to compute the leaderboards in this repository.
test – The test set, which you currently should not use at all. It will be used at the end to compare the systems, which performed best on the validation set.

The task1 and task2 tasks also come with the all subset, which contains all relevance judgements. Use these to evaluate a system that has not been trained using subsets of the task1 and task2 tasks.

The task1 and task2 tasks also come with a different subset split used by the MIRMU and MSM teams in the ARQMath-2 competition submissions. This split is also used in the pv211-utils library:

train-pv211-utils – The training set, which you can use for supervised training of your system.
validation-pv211-utils – The validation set, which you can use for hyperparameter optimization or model selection.

The training set is futher split into the smaller-train-pv211-utils and smaller-validation subsets in case you need two validation sets for e.g. hyperparameter optimization and model selection. If you don't use either hyperparameter optimization or model selection, you can use the bigger-train-pv211-utils subset, which combines the train-pv211-utils and validation-pv211-utils subsets.

test-pv211-utils – The test set, which you currently should only use for the final performance estimation of your system.

Examples

Using the `train` subset to train your supervised system

$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python
>>> from arqmath_eval import get_topics, get_judged_documents, get_ndcg
>>>
>>> task = 'task1'
>>> subset = 'train'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>>     results[topic] = {}
>>>     for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>>        similarity_score = compute_similarity_score(topic, document)
>>>        results[topic][document] = similarity_score
>>>
>>> get_ndcg(results, task='task1-votes', subset='train', topn=1000)
0.5876

Using the `validation` subset to compare various parameters of your system

$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python
>>> from arqmath_eval import get_topics, get_judged_documents
>>>
>>> task = 'task1'
>>> subset = 'validation'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>>     results[topic] = {}
>>>     for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>>        similarity_score = compute_similarity_score(topic, document)
>>>        results[topic][document] = similarity_score
>>>
>>> user = 'xnovot32'
>>> description = 'parameter1=value_parameter2=value'
>>> filename = '{}/{}/{}.tsv'.format(task, user, description)
>>> with open(filename, 'wt') as f:
>>>     for topic, documents in results.items():
>>>         top_documents = sorted(documents.items(), key=lambda x: x[1], reverse=True)[:1000]
>>>         for rank, (document, score) in enumerate(top_documents):
>>>             line = '{}\txxx\t{}\t{}\t{}\txxx'.format(topic, document, rank + 1, score)
>>>             print(line, file=f)
$ git add task1-votes/xnovot32/result.tsv  # track your new result with Git
$ python -m arqmath_eval.evaluate          # run the evaluation
$ git add -u                               # add the updated leaderboard to Git
$ git push                                 # publish your new result and the updated leaderboard

Using the `all` subset to compute the NDCG' score of an ARQMath submission

$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all 2020
0.238, 95% CI: [0.198; 0.278]

Citing ARQMath-eval

Text

NOVOTNÝ, Vít, Petr SOJKA, Michal ŠTEFÁNIK and Dávid LUPTÁK. Three is Better than One: Ensembling Math Information Retrieval Systems. CEUR Workshop Proceedings. Thessaloniki, Greece: M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen., 2020, vol. 2020, No 2696, p. 1-30. ISSN 1613-0073.

BibTeX

@inproceedings{mir:mirmuARQMath2020,
  title = {{Three is Better than One}},
  author = {V\'{i}t Novotn\'{y} and Petr Sojka and Michal \v{S}tef\'{a}nik and D\'{a}vid Lupt\'{a}k},
  booktitle = {CEUR Workshop Proceedings: ARQMath task at CLEF conference},
  publisher = {CEUR-WS},
  address = {Thessaloniki, Greece},
  date = {22--25 September, 2020},
  year = 2020,
  volume = 2696,
  pages = {1--30},
  url = {http://ceur-ws.org/Vol-2696/paper_235.pdf},
}

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
.github/workflows		.github/workflows
docs		docs
ntcir-11-math-2-main		ntcir-11-math-2-main
ntcir-12-mathir-arxiv-main		ntcir-12-mathir-arxiv-main
ntcir-12-mathir-math-wiki-formula		ntcir-12-mathir-math-wiki-formula
scripts		scripts
task1-example		task1-example
task1-votes		task1-votes
test		test
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ARQMath-eval

Description

Tasks

Subsets

Examples

Using the `train` subset to train your supervised system

Using the `validation` subset to compare various parameters of your system

Using the `all` subset to compute the NDCG' score of an ARQMath submission

Citing ARQMath-eval

Text

BibTeX

About

Releases

Packages

Contributors 3

Languages

MIR-MU/ARQMath-eval

Folders and files

Latest commit

History

Repository files navigation

ARQMath-eval

Description

Tasks

Subsets

Examples

Using the train subset to train your supervised system

Using the validation subset to compare various parameters of your system

Using the all subset to compute the NDCG' score of an ARQMath submission

Citing ARQMath-eval

Text

BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Using the `train` subset to train your supervised system

Using the `validation` subset to compare various parameters of your system

Using the `all` subset to compute the NDCG' score of an ARQMath submission

Packages