This repository contains Jupyter notebooks for training word2vec and querying models for Ancient Greek.
You will need to install Gensim to use the notebooks in this repo. Both notebooks attempt to install it in case it is missing:
try:
from collections.abc import Mapping
from gensim.models.word2vec import Word2Vec
except:
!pip install -I gensim
from collections.abc import Mapping
from gensim.models.word2vec import Word2Vec
Before running these scripts, therefore, you should ensure you are working in a virtual env using Conda or some alternative.
The notebook query_w2v_model.ipynb
contains code to load up an existing model and query it. Once the code runs successfully through, you can query additional terms by imitating the example code.
For example, querying 'ἀρχιερεύς' ('high priest') using model.most_similar('ἀρχιερεύς', topn=10)
generates the following results:
[('ἱερεύς', 0.6741834878921509), # 'priest' is 67% similar
('πατριάρχης', 0.6512661576271057), # 'patriarch' is 65% similar
('διάκονος', 0.6433295011520386), # 'deacon'/'servant' is 64% similar
('ἰούδας', 0.6175312399864197), # 'Jude'/'judas' is 61% similar, etc.
('ἀρχιερωσύνη', 0.6078990697860718),
('χρηματίζω', 0.5876879096031189),
('χειροτονέω', 0.577594518661499),
('ἀαρών', 0.5753183364868164),
('πιλᾶτος', 0.572640061378479),
('ἀναγορεύω', 0.5716601610183716)]
The notebook build_greek_w2v_model.ipynb
contains code to train your own model, including specifying a data directory and setting hyperparameters such as vector size, context window size, etc.
You can also change the neural network type (either skip-gram or continuous bag-of-words). Anecdotally, I find the CBOW to work better for lemmatized comparisons. For example, for the term λόγος ('word'/'reason'/'account'/etc.), compare the following results:
# Skip-gram
[('ἀξιομνημόνευτος', 0.6042230129241943),
('ἐφεκτέον', 0.6000658869743347),
('ψευδοδοξία', 0.589237630367279),
('λογοποιία', 0.5858883261680603),
('ἐξεταστικός', 0.5815737247467041),
('ἀκριβολογία', 0.5795413851737976),
('ὑφήγησις', 0.5784241557121277),
('ἀναμφίλεκτος', 0.5765267014503479),
('ἰσοσθένεια', 0.5729403495788574),
('εἰσαγωγικός', 0.5722008347511292)]
# CBOW
[('ἑρμηνεία', 0.5288925766944885),
('ἐξήγησις', 0.5065293312072754),
('θεολογία', 0.4956413805484772),
('διδασκαλία', 0.48970848321914673),
('ὑπόληψις', 0.47546130418777466),
('διήγησις', 0.47397667169570923),
('σκέψις', 0.47282832860946655),
('θεωρία', 0.47240814566612244),
('διαλέγω', 0.4657094478607178),
('φυσιολογία', 0.4602492153644562)]
The skip-gram approach may work better with a higher min_count_input
(10? 100?) to exclude rare lemmas.
Theoretically you could use just about any language data in plaintext format as input. The data is contained in the data/corpus
subdirectory, and consists of a set of plaintext files that end with the .txt
extension.
The format for the input data is one-sentence per line.
The data included in this repository is lemmatized, with 19,053,248 lemmas in data/corpus
(check using cat data/corpus/*.txt | wc -w
in the repository root).
The files under data/papyri
includes 1,759,488 lemmas (check using find data/papyri -type f -name "*.txt" -exec cat {} + | wc -w
to avoid the 'argument list too long' error).
This data has been extracted from LemmatizedAncientGreekXML and MALP. Thank you, Giuseppe!
TODO: There are some zero-byte files in the corpus. Presumably, no lemmas were extracted for these files. This should be investigated in order to increase the corpus size.
TODO: Add ngram detection preprocessing option
TODO: Add FastText sub-word model