This repo combines Transformer Grammar with URNNG structure (that VAE structure).
python preprocess.py --trainfile data/train_02-21.LDC99T42 --valfile data/dev_24.LDC99T42 --testfile data/test_23.LDC99T42 --outputfile data/ptb_20231129 --vocabminfreq 1 --lowercase 0 --replace_num 0 --batchsize 16
Running this will save the following files in the data/
folder: ptb-train.pkl
, ptb-val.pkl
, ptb-test.pkl
, ptb.dict
. Here ptb.dict
is the word-idx mapping, and you can change the output folder/name by changing the argument to outputfile
. Also, the preprocessing here will replace singletons with a single <unk>
rather than with Berkeley parser's mapping rules (see below for results using this setup).
We don't consider the tags of the NTs in groundtruths.
- Install CMake
- Execute these
mkdir .dependencies
cd .dependencies
git clone -b 20220623.1 https://github.com/abseil/abseil-cpp.git
git clone -b 3.4.0 https://gitlab.com/libeigen/eigen.git
git clone -b v2.10.2 https://github.com/pybind/pybind11.git
# Sentencepiece Building
# git clone -b v0.1.97 https://github.com/google/sentencepiece.git
# cd sentencepiece
# mkdir build
# cd build
# cmake ..
# make -j # if not available, that will not matter?
# masking cpp building
# ensure this exists: ./.dependencies
# .dependencies is the sibling of ./masking from the parent directory .
cd ..
mkdir build
cd build
cmake ..
make -j
To train the U-TG:
python tg_train.py --train_file data/ptb-train.pkl --val_file data/ptb-val.pkl --save_path /ckpt/utg_ckpt.pt --mode unsupervised --gpu 0
# debug
python tg_train.py --train_file data/ptb_20231129-train.pkl --val_file data/ptb_20231129-val.pkl --save_path /ckpt/utg_ckpt_240103_1.pt --mode unsupervised --gpu 0 --samples 2 --lr 0.00005 --q_lr 0.0001
python eval_ppl.py --model_file /ckpt/utg_ckpt.pt --test_file data/ptb-test.pkl --samples 1000 --is_temp 2 --gpu 0
-
Parse the test set
python parse.py --model_file /ckpt/utg_ckpt.pt --data_file data/ptb-test.txt --out_file pred-parse.txt --gold_out_file gold-parse.txt --gpu 0
-
Evalb evaluation
evalb -p COLLINS.prm gold-parse.txt test-parse.txt
The GPU is not allocated / available, or GCC version is obsolete (the version should >= 4.9)