Possibility of using regex for generating geminates #105

kudanai · 2021-12-29T15:44:53Z

I'm trying to write post processing rules for div-Thaa over on this fork.

The rules dicatate that for occurrences of certain graphemes އް would have effect of having the next consonant be a geminate in some situations. I can't seem to figure out if this can be done with a single regex rule with a match group or not.

For the time being I've added the cases as individual rules here. The rules in question are the ones with <AS> in them.

TL;DR;

Is it possible to write gemination rules in regex?
If not do the rules as written here make sense

Apologies in advance if this is a redundant question and I missed something in the docs.

The text was updated successfully, but these errors were encountered:

dmort27 · 2021-12-29T16:15:06Z

Here are some comments:

There are two ways of writing geminate consonants in the IPA:

Doubling the consonant (unless it is an affricate, in which case the plosive is doubled)
Using the long mark (ː).

For reasons of parseability with PanPhon, the second solution is the approved Epitran solution (so <އް> could simply be mapped to /ː/). If you need doubling instead, you can achieve this with a regular expression and named groups, e.g.:

(?P<seg>(p|t|k)): -> \g<seg>\g<seg> \ _

will change p:, t:, and k: to pp, tt, and kk.

The prefixed \s? in your rules is not doing any good since it doesn't rule anything out—a substring either is or is not preceded by a space. In any case, you should be using Epitran with already tokenized text rather than passages with internal whitespace. Otherwise, your rules look fine.

kudanai · 2021-12-29T17:32:57Z

Thank you for the comments.

First on the \s, they are a bit tricky in this script. The effects of the next consonant on އް can go beyond the token boundary. What this probably actually means is that I need a better tokeniser than the currently available ones. I will investigate more on this. It will be sorted out before I request a merge.

On the geminates, at first I gave this a try, which did not seem to work (I'm not sure if I'm writing that rule wrong or if the \g syntax just isn't working for me). So taking your suggestion on using the long mark, I rewrote it using the swap groups

## This did not work
<AS>(?P<seg>::consonant::) -> \g<seg>\g<seg> / _

## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋa\g<seg>\g<seg>afuŋɡe

but

## This works
(?P<sw1><AS>)\s?(?P<sw2>::consonant::) -> 0 / _
<AS> -> : / (::consonant::) _ (::vowel::)

## outputs
ނުގެންނަން މުވައްޒަފުންގެ
nuɡennammuʋaz:afuŋɡe

Does this have an impact on affricates? We have two /d͡ʒ/ and /t͡ʃ/

Also, I'm hesitant to simply map އް to : - it would complicate the post processing rules since the language uses a lot long vowels, and އް can also cause pre-nasalisation or serve as a glottal stop depending on context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of using regex for generating geminates #105

Possibility of using regex for generating geminates #105

kudanai commented Dec 29, 2021

dmort27 commented Dec 29, 2021

kudanai commented Dec 29, 2021

Possibility of using regex for generating geminates #105

Possibility of using regex for generating geminates #105

Comments

kudanai commented Dec 29, 2021

dmort27 commented Dec 29, 2021

kudanai commented Dec 29, 2021