This repository contains the code for recreating the Schema.org Table Annotation Benchmark .
SOTAB is created based on the Schema.org Table Corpus . To run the code for creating SOTAB, all zip files from the top100 and minimum3 subsets of Schema.org Table Corpus need to be downloaded and put in the directory: data/stc_zip_files/
Run download.sh
to download processed datasets for the VizNet corpus.
It will also create data
directory.
$ bash download.sh
To create the SOTAB datasets for Column Type Annotation and Column Property Annotation the notebooks need to be run in the order stated below:
- Language Detection
- MatchColumnNamesToSchema.org
- Expand properties-CreateTables
- AnnotatingTables
- TableSelection-CPA
- Different-Formats-CPA
- RandomColumns-CPA
- TableSelection-CTA
- CreatingSplits-CPA
- CreatingSplits-CTA