Skip to content

macchi-franco/latin-bert

 
 

Repository files navigation

latin-bert

Latin BERT is a contextual language model for the Latin language, described in more detail in the following:

David Bamman and Patrick J. Burns (2020), Latin BERT: A Contextual Language Model for Classical Philology, ArXiv.

Install

Tested on Python 3.8.12 and 3.7.12.

1.) Create a conda environment (optional):

conda create --name latinbert python=3
conda activate latinbert

2.) Install PyTorch according to your own system requirements (GPU vs. CPU, CUDA version): https://pytorch.org.

3.) Install the remaining libraries:

pip install -r requirements.txt

4.) Install Latin tokenizer models:

python3 -c "from cltk.data.fetch import FetchCorpus; corpus_downloader = FetchCorpus(language='lat');corpus_downloader.import_corpus('lat_models_cltk')"

5.) Download pre-trained BERT model for Latin:

./scripts/download.sh

Minimal example

For a minimal example of how to generate BERT representations for an input sentence, execute the following:

python3 scripts/gen_berts.py --bertPath models/latin_bert/ --tokenizerPath models/subword_tokenizer_latin/latin.subword.encoder > berts.output.txt

This generates BERT representations for two sentences and saves their output with one (token, 768-dimensional final BERT representation) tuple per line. For examples of how to fine-tune Latin BERT for a specific task, see the case studies on POS tagging and WSD.

Data

Latin BERT is pre-trained using data from the following sources.

Source Tokens
Corpus Thomisticum 14.1M
Internet Archive 561.1M
Latin Library 15.8M
Patrologia Latina 29.3M
Perseus 6.5M
Latin Wikipedia 15.8M
Total 642.7M

Texts from Perseus and the Latin Library are drawn from the corpora in the Classical Language Toolkit. Texts are tokenized for sentences and words using Latin-specific tokenizers in CLTK. We learn a Latin-specific WordPiece tokenizer using tensor2tensor from this training data.

Since the texts from the Internet Archive (IA) are the product of noisy OCR, we uniformly upsample all non-IA texts to train on a balance of approximately 50% IA texts and 50% non-IA texts.

Training

We pre-train Latin BERT using tensorflow on a TPU for 1M steps. Training took approximately 5 days on a TPU v2, and cost ~$540 on Google Cloud (at $4.50 per TPU v2 hour). We set the maximum sequence length to 256 WordPiece tokens.

We convert the resulting tensorflow checkpoint into a BERT model that can used by the HuggingFace library using the transformers-cli library. The model in model/latin_bert can be used with the HuggingFace transformers library.

Case studies

Bamman and Burns (2020) illustrates the affordances of Latin BERT with four case studies; here is a quick summary of them.

1. POS Tagging

Latin BERT demonstrates meaningful part-of-speech distinctions in its representations without further task-specific training.

drawing

When trained on POS tagging, Latin BERT achieves a new state of the art on all three Universal Dependency datasets for Latin.

Method Perseus PROIEL ITTB
Latin BERT 94.3 98.2 98.8
Straka et al. (2019) 90.0 97.2 98.4
Smith et al. (2018) 88.7 96.2 98.3
Straka (2018) 87.6 96.8 98.3
Static embeddings 87.6 95.2 97.6
Boros et al. (2018) 85.7 94.6 97.7

2. Text infilling

Latin BERT can be used to generate probabilites for lacunae and other missing words in context. For example, consider the following sentence:

dvces et reges carthaginiensivm hanno et mago qui ___ punico bello cornelium consulem aput liparas ceperunt

The words with the highest probabilities predicted to fill that slot are the following:

Word Probability
secundo 0.451
primo 0.385
tertio 0.093
altero 0.018
primi 0.012
priore 0.012
quarto 0.005
secundi 0.004
primum 0.002
superiore 0.002

(Note "primo" here is a textual critic's emendation.) Latin BERT is able to reconstruct an exact human-judged ementation 33.1% of the time; in 62.2% of cases, the human emendation is in the top 10 predictions.

3. Word sense disambiguation

Latin BERT is able to distinguish between senses of Latin words. We construct a new WSD dataset by mining citations from the Lewis and Short Latin Dictionary, and measure the ability of different methods to distinguish between them given the context of the sentence. In a balanced evaluation (where random choice yields 50% accuracy), Latin BERT outperforms static embeddings by over 8 absolute points.

Method Accuracy
Latin BERT 75.4
Static embeddings 67.3
Random 50.0

4. Contextual nearest neighbors

BERT representations are contextual embeddings, so the same word type (e.g., in) will have a different representation in each sentence in which it is used. While static embeddings like word2vec allow us to find words that are most similar to a given word type, BERT (and other contextual embeddings) allow us to find other words that are most similar to a given word token. For example, we can find tokens in context that are most similar to the representation for in within gallia est omnis divisa in partes tres:

Cosine text citation
0.835 ager romanus primum divisus in partis tris, a quo tribus appellata titiensium ... Varro, Ling.
0.834 in ea regna duodeviginti dividuntur in duas partes. Sol.
0.833 gallia est omnis divisa in partes tres, quarum unam incolunt belgae, aliam ... Caes., BGall.
0.824 is pagus appellabatur tigurinus; nam omnis civitas helvetia in quattuor pagos divisa est. Caes., BGall.
0.820 ea pars, quam africam appellavimus, dividitur in duas provincias, veterem et novam, discretas fossa ... Plin., HN
0.817 eam distinxit in partes quatuor. Erasmus, Ep.
0.812 hereditas plerumque dividitur in duodecim uncias, quae assis appellatione continentur. Justinian, Inst.

The most similar tokens not only capture the specific morphological constraints of this sense of in appearing with a noun in the accusative case (denoting into rather than within) but also broadly capture the more specific subsense of division into parts.

Releases

No releases published

Packages

No packages published

Languages

  • Shell 84.4%
  • Python 15.6%