This repository provides a command-line interface for the Minimal Generalization Learner (Albright & Hayes 2002, 2003, etc.) and slightly modified input files for English past-tense simulations. Send comments or questions to colin-dot-wilson-at-jhu-dot-edu.
The original version of the learner with a GUI interface, files downloaded [06/14/2021] from http://www.mit.edu/~albright/mgl/ and https://linguistics.ucla.edu/people/hayes/RulesVsAnalogy/index.html. MinGenLearner.jar seems to provide the most up-to-date compiled code. Run with: java -jar MinGenLearner.jar
The new version of the learner with a command-line interface. The (only) file added to the original code is src/LearnerCommandline.java. Arguments for the learner are specified with YAML files, see for example english.yaml. Run the English past-tense simulation with 00runme.sh
.
This version comes already compiled (with Java 16). Recompile with: ant -buildfile MinimalGeneralizationLearner.xml
. The one external dependency is SnakeYAML, included here as extern/snakeyaml-1.29.jar.
The new version also has the original GUI interace, run with: java -jar bin/mingenlearn.jar
English past-tense data provided by Albright & Hayes, with various transcriptions.
-
English1 and English2 have the original small and large data sets, respectively. Some erroneous past-tense forms in the original English2 data are listed in English2_errors.txt.
-
English2_unicode has transcriptions that are closer to IPA but adhere to the one-symbol-per-phoneme format required by the learner; these are the input files for the example simulation in
version2/00runme.sh
. Also in this folder are the English data from the SIGMORPHON 2021 Shared Task on Generalization in Morphological Inflection Generation. -
English2_IPA has the data in space-separated IPA transcription, which is incompatible with the learner.
-
English_phonemes.ods compares the transcription systems and provides a feature matrix.
The format of feature (.fea) files is quite strict, as follows:
ASCII<tab>Seg.<tab><long feature names, tab-separated>
<tab><tab><short feature names, tab-separated>
<id_1><tab>segment_1<tab><feature specifications (+1, 0, -1), tab-separated>
<id_2><tab>segment_2<tab><feature specifications (+1, 0, -1), tab-separated>
...
<id_m><tab>ˈ<tab><feature specifications of stress symbol, tab-separated>
<id_n><tab>~<tab><feature specifications of empty symbol, tab-separated>
The 'ASCII' values must be unique integers, but are otherwise arbitrary.
The stress symbol can be omitted if it does not appear in the transcriptions.