The top level folders are of the following type and purpose:
-
ak, akt
: Python modules constituting the language class modules for Akkadian normalization, transliteration respectively. As explained in the documentation, the user should ultimately move these to the/lang
subfolder of their spaCy installation. -
ak_basic_model
,ak_norm_model
,akt_trans_model
: folders containing everything for training language models.ak_basic_model
is for the basic normalized model trained on SAA 1, 5 and discussed in Ong and Gordin 2023 (to appear). The other two are for the expanded models trained on normalized or transliterated files. -
anzu
,barutu
,riao_saa01
,rinap4
, etc. : these contain conllu files used for training the language models, according to the Oracc corpus. Some of them also contain the base text files in normalization and/or.spacy
binary files resulting from converting the conllu files. They also generally contain helper scriptscheck_conllu.py
andrenumber_conllu.py
. -
syntax_sentences
: similar to the above save for artificial training examples made either completely by hand or modeled on actual training sentences. -
utils
: contains a number of helper scripts and data files needed for the pipeline. Among which are the scripts for scraping and formatting Oracc corpora in either normalization or transliteration, scripts for merging the helper dictionaries and attribute ruler file already accompanying the models/treebank and the ones generated by whatever new Oracc corpus you want to integrate into the, and scripts for handling the conversion of completely processed, normalized conllu files into their transliterated equivalent. The last is found under the/merge
subfolder. There is also a pdf file listing the morphological feature strings I generally use in paste/copy fashion when manually annotate morphology in the conllu files within Inception.