This repository contains code, which you can use to evaluate your system runs from the ARQMath competitions.
This repository evaluates the performance of your information retrieval system on a number of tasks:
task1-example
– ARQMath Task1 example dataset,task1-votes
– ARQMath Task1 Math StackExchange user votes,task1
,task1-2020
– ARQMath Task1 final dataset,task1-2021
– ARQMath-2 Task1 final dataset,ntcir-11-math-2-main
– NTCIR-11 Math-2 Task Main Subtask,ntcir-12-mathir-arxiv-main
– NTCIR-12 MathIR Task ArXiv Main Subtask,ntcir-12-mathir-math-wiki-formula
– NTCIR-12 MathIR Task MathWikiFormula Subtask,task2
,task2-2020
– ARQMath Task2 final dataset, andtask2-2021
– ARQMath-2 Task2 final dataset.
The main tasks are:
task1
– Use this task to evaluate your ARQMath task 1 system, andtask2
– Use this task to evaluate your ARQMath task 2 system.
Each task comes with three subsets:
train
– The training set, which you can use for supervised training of your system.validation
– The validation set, which you can use to compare the performance of your system with different parameters. The validation set is used to compute the leaderboards in this repository.test
– The test set, which you currently should not use at all. It will be used at the end to compare the systems, which performed best on the validation set.
The task1
and task2
tasks also come with the all
subset, which contains
all relevance judgements. Use these to evaluate a system that has not been
trained using subsets of the task1
and task2
tasks.
The task1
and task2
tasks also come with a different subset split used by
the MIRMU and MSM teams in the ARQMath-2 competition submissions. This split is
also used in the pv211-utils library:
train-pv211-utils
– The training set, which you can use for supervised training of your system.validation-pv211-utils
– The validation set, which you can use for hyperparameter optimization or model selection.
The training set is futher split into the smaller-train-pv211-utils
and
smaller-validation
subsets in case you need two validation sets for e.g.
hyperparameter optimization and model selection. If you don't use either
hyperparameter optimization or model selection, you can use the
bigger-train-pv211-utils
subset, which combines the train-pv211-utils
and
validation-pv211-utils
subsets.
test-pv211-utils
– The test set, which you currently should only use for the final performance estimation of your system.
$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python
>>> from arqmath_eval import get_topics, get_judged_documents, get_ndcg
>>>
>>> task = 'task1'
>>> subset = 'train'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>> results[topic] = {}
>>> for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>> similarity_score = compute_similarity_score(topic, document)
>>> results[topic][document] = similarity_score
>>>
>>> get_ndcg(results, task='task1-votes', subset='train', topn=1000)
0.5876
$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python
>>> from arqmath_eval import get_topics, get_judged_documents
>>>
>>> task = 'task1'
>>> subset = 'validation'
>>> results = {}
>>> for topic in get_topics(task=task, subset=subset):
>>> results[topic] = {}
>>> for document in get_judged_documents(task=task, subset=subset, topic=topic):
>>> similarity_score = compute_similarity_score(topic, document)
>>> results[topic][document] = similarity_score
>>>
>>> user = 'xnovot32'
>>> description = 'parameter1=value_parameter2=value'
>>> filename = '{}/{}/{}.tsv'.format(task, user, description)
>>> with open(filename, 'wt') as f:
>>> for topic, documents in results.items():
>>> top_documents = sorted(documents.items(), key=lambda x: x[1], reverse=True)[:1000]
>>> for rank, (document, score) in enumerate(top_documents):
>>> line = '{}\txxx\t{}\t{}\t{}\txxx'.format(topic, document, rank + 1, score)
>>> print(line, file=f)
$ git add task1-votes/xnovot32/result.tsv # track your new result with Git
$ python -m arqmath_eval.evaluate # run the evaluation
$ git add -u # add the updated leaderboard to Git
$ git push # publish your new result and the updated leaderboard
$ pip install --force-reinstall git+https://github.com/MIR-MU/[email protected]
$ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all 2020
0.238, 95% CI: [0.198; 0.278]
NOVOTNÝ, Vít, Petr SOJKA, Michal ŠTEFÁNIK and Dávid LUPTÁK. Three is Better than One: Ensembling Math Information Retrieval Systems. CEUR Workshop Proceedings. Thessaloniki, Greece: M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen., 2020, vol. 2020, No 2696, p. 1-30. ISSN 1613-0073.
@inproceedings{mir:mirmuARQMath2020,
title = {{Three is Better than One}},
author = {V\'{i}t Novotn\'{y} and Petr Sojka and Michal \v{S}tef\'{a}nik and D\'{a}vid Lupt\'{a}k},
booktitle = {CEUR Workshop Proceedings: ARQMath task at CLEF conference},
publisher = {CEUR-WS},
address = {Thessaloniki, Greece},
date = {22--25 September, 2020},
year = 2020,
volume = 2696,
pages = {1--30},
url = {http://ceur-ws.org/Vol-2696/paper_235.pdf},
}