Skip to content

MIR-MU/ARQMath-data-preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ARQMath Data Preprocessing

This repository contains scripts for producting preprocessed ARQMath competition datasets:

  • output_data/ARQMath_CLEF2020/Formulas/formula_*.V1.0.{tsv,failures}
    all formulae for the ARQMath competition,
  • output_data/ARQMath_CLEF2020/Task1/Sample Topics/Formula_topics_*_V2.0.{tsv,failures}
    formulae from sample topics for task 1 of the ARQMath competition,
  • output_data/ARQMath_CLEF2020/Task1/Topics/Formula_topics_*_V2.0.{tsv,failures}
    formulae from testing topics for task 1 of the ARQMath competition,
  • output_data/ARQMath_CLEF2020/Task2/Formula_topics_*_V2.0.{tsv,failures}
    formulae from testing topics for task 2 of the ARQMath competition,
  • output_data/ARQMath_CLEF2020/Collection/votes-qrels.V1.0.tsv
    our relevance judgements for task 1 of the ARQMath competition,
  • output_data/ARQMath_CLEF2020/Collection/Posts_V1_0_*.json.gz
    the document collection for the ARQMath competition,
  • output_data/arxiv-dataset-arXMLiv-08-2019/arxmliv_*_08_2019_*.json.gz.{json.gz,failures}
    tokenized documents and paragraphs from the arXMLiv 08.2019 dataset,
  • output_data/ntcir/NTCIR11-Math/NTCIR11-Math2-queries-*-participants.{json,failures}
    tokenized topics from the NTCIR-11 Math-2 Task Main Subtask, and
  • output_data/ntcir/NTCIR12-Math/NTCIR12-Math-queries-*-participants.{json,failures}
    tokenized topics from the NTCIR-12 MathIR Task ArXiv Main Subtask.
  • output_data/ntcir/NTCIR12-Math-Wiki-Formula/NTCIR12-MathWikiFormula-queries-*-participants.{json,failures}
    tokenized topics from the NTCIR-12 MathIR Task MathWikiFormula Subtask.
  • output_data/ntcir/NTCIR12-Math-Wiki-Formula/MathTagArticles_*.json.gz
    tokenized arXiv articles from the NTCIR-12 MathIR Task MathWikiFormula Subtask.

Downloading the preprocessed datasets

To download the preprocessed datasets, run the following commands:

$ pip install -r requirements.txt
$ dvc pull

Producing the preprocessed datasets

To produce the preprocessed datasets yourself,

About

Preprocessed ARQMath competition datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published