GitHub - Grason-Lu/ct-bert

Paper | Example | 😀 Pre-trained Model | 📕 Pre-training Dataset

This is the official Github repository for the paper "CT-BERT: Learning Better Tabular Representations Through Cross-Table Pre-training" by Chao Ye and Guoshan Lu, etc.

Overview

Tabular data — also known as structured data — is one of the most common data forms in existence, thanks to the stable development and scaled deployment of database systems in the last few decades. At present however, despite the blast brought by large pre-trained models in other domains such as ChatGPT or SAM, how can we extract common knowledge across tables at a scale that may eventually lead to generalizable representation for tabular data remains a full blank. In this project, we present CT-BERT, a generic and efcient cross-tablepre-training solution.

Code

The run.sh file demonstrates commands to perform supervised learning from scratch, fine-tuning, and cross-table pre-training with CT-BERT.

Datasets

Cross-table Pre-training Dataset (our TabPretNET)

We have uploaded our pre-training corpus to to Google Cloud Drive. You can download it from here and use this code to load all the datasets. We will soon be open-sourcing our cross-table pre-training models (CT-BERT-v1) and datasets (TabPretNET) on Huggingface.

Downstream Task Datasets Source

Citation

If you use CT-BERT in your work, please cite our paper:

@misc{ye2023ctbert,
      title={CT-BERT: Learning Better Tabular Representations Through Cross-Table Pre-training}, 
      author={Chao Ye and Guoshan Lu and Haobo Wang and Liyao Li and Sai Wu and Gang Chen and Junbo Zhao},
      year={2023},
      eprint={2307.04308},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Acknowledgement

Deepspeed: We use the deepspeed framework to pre-train our models. We are very grateful for Microsoft's deepspeed framework to make large-scale pre training deployments possible.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
CTBert		CTBert
Image		Image
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
ds_config_yc.json		ds_config_yc.json
run.sh		run.sh
run_CL_pretrain_ds.py		run_CL_pretrain_ds.py
run_finetune.py		run_finetune.py
run_mask_pretrain.py		run_mask_pretrain.py
run_mask_pretrain_ds.py		run_mask_pretrain_ds.py
run_scratch.py		run_scratch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper | Example | 😀 Pre-trained Model | 📕 Pre-training Dataset

Overview

Code

Datasets

Cross-table Pre-training Dataset (our TabPretNET)

Downstream Task Datasets Source

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

Grason-Lu/ct-bert

Folders and files

Latest commit

History

Repository files navigation

Paper | Example | 😀 Pre-trained Model | 📕 Pre-training Dataset

Overview

Code

Datasets

Cross-table Pre-training Dataset (our TabPretNET)

Downstream Task Datasets Source

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages