This repository contains code used in the Language Model Adaptation for Low-Resource African Languages project.
The corresponding trained and adapted tokenizers as well as models can be found on the HuggingFace site of the project.
evaluation/
- Code used for model evaluation on downstream tasks. In addition contains processed results.modelling/
- Functions for model embedding matrix modifications.scripts/
- Bash scripts for dataset processing, tokenizer and model training and model adaptation. Scripts come with SGE scheduler flags.tokenization/
- Functions for tokenizer adaptation.training/
- Functions for training dataset pre-processing and model training.fertility_analysis/
- Fertility evaluation results of selected tokenizers.add_tokens.py
- Tokenizer adaptation through token addition.replace_tokens.py
- Tokenizer adaptation through token replacement.add_embeddings.py
- Model embedding matrix modification through embedding addition.replace_embeddings.py
- Model embedding matrix modification through embedding replacement.fertility_evaluation.py
- Script used for tokenizer fertility evaluation on WURA validation sets.train_model.py
- Model training script.train_wura_tokenizer.py
- Script used for training language-dedicated tokenizers using the WURA dataset.requirements.txt
- A file containing a list of Python pip packages.README.md
- This file :)
Download data:
- WURA dataset and place it in a
./data/wura
directory.
To reproduce the tokenizer fertility results, run the following scripts:
- Train language-dedicated tokenizers using
scripts/train_wura_tokenizers_opt.qsub.sh
. - Run
add_tokens.py
andreplace_tokens.py
to produce adapted tokenizers. - Specify paths to desired tokenizers and run
fertility_evaluation.py
.
To reproduce model adaptation results, run the above and the following:
- Run
add_embeddings.py
andreplace_embeddings.py
to create models with modified embeddings. - Run all scripts from
scripts/dataset_processing
to pre-process, tokenize and group training samples. - Train models using scripts from
scripts/model_training
. - Run
scripts/download_evaluation_data_repos.sh
to download evaluation datasets. - Evaluate models
- Generate model answers by running scripts in
scripts/model_evaluation
. - Aggregate and compute metrics using scripts from
scripts/evaluation_results_processing
.
- Generate model answers by running scripts in