GitHub - dougalg/bunpou: FST morphological segmenter of Japanese adjectives and verbs using XFST

This repository has been archived by the owner on Sep 15, 2024. It is now read-only.

dougalg / bunpou Public archive

Notifications You must be signed in to change notification settings
Fork 0
Star 1

FST morphological segmenter of Japanese adjectives and verbs using XFST

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Converter		Converter
lexicon		lexicon
rules		rules
Makefile		Makefile
README		README
compile_fst.txt		compile_fst.txt
compile_fst_unclean.txt		compile_fst_unclean.txt
compile_lex.txt		compile_lex.txt
full.fst		full.fst
full_lex.txt		full_lex.txt
full_unclean.fst		full_unclean.fst
gpl.txt		gpl.txt
lex.fst		lex.fst
rules.fst		rules.fst

Repository files navigation

Bunpou Morphological Segmenter for Japan
v0.2.0

WARNING: This is not a stable version and is not currently intended for production use.

This is a a morphological segmenter to be used with Xerox's XFST Finite-State tool (fsmbook.com). It can be incorporated into c/c++/obj-c programs with their XCFSM library or into python apps using the python library. Please see their website for more information.

The words for the lexicon were all taken from the JMdict project from the Electronic Dictionary Research and Development Group (http://www.csse.monash.edu.au/~jwb/jmdict.html).

The Converter directory contains an FST for converter between hiragana/romaji.

Installation:

The FSTs should be usable as-is with XFST or any of the APIs available on the fsmbook.com website.

If you need to compile from source, the included makefile can be used. Assuming you have XFST and LEXC installed, you need only run "make full.fst" to compile all three FSTs.

Usage Notes:

1) Romaji input/output:

The romaji output by the converter, and to be input into the converter and the full and lexical FSTs should be of the Kunrei Shiki Romaji system (http://en.wikipedia.org/wiki/Kunrei-shiki_Rōmaji). This is to make the application of morpho-phonological rules easier.

2) Kanji input/output

You should be able to input pure romaji, or kanji and romaji into the full.fst and receive equal results, although words containing more than one kanji character must have all their kanji characters included

To Do:
- Predictive
- Suggest corrections to mis-spellings
- Wider range of converter options

UPDATES:

2011/11/19
- Added "=" symbol to connect semantic markers for portmanteau morphemes
- Added full_unclean.fst which is same as full.fst but without the cleanup rules so that one can pass in a root + set of semantic suffix tags to get back a *segmented* word, rather than the unsegmented form that full.fst would provide

2010/02/26
- Added korareru and conjugations to irregular verb lexicon
- Fixed problems with kr_converter.fst where it would convert しゃ to siゃ rather than to sya
- Fixed the inverse issues with rk_converter.fst