Arabic diacritics displaced by kashida.plain transforms #257

lueck · 2023-08-13T19:43:33Z

Thanks for your great effort on the kashida feature! I know, that
it's still in experimental state. I'd like to point to some issues.

Here's a MWE with an analytic tool for investigating the input
characters (for helping people like me who have to typeset Arabic but
can't read it). It uses \makebox[LENGTH][s]{TESTCASE} for forcing
kashida elongation on single words for demonstration.

\documentclass{book}

\usepackage{luabidi}
\setRTLmain

\usepackage[ngerman,english,bidi=basic]{babel}[2021/05/16]% version 3.59 or later

\babelprovide[import,main,%
justification=kashida,%
transforms=kashida.plain%
]{arabic}

\babelfont{rm}[Scale=3]{FreeSerif} % {ArabicTypesetting} %

\usepackage{luacode}

\begin{filecontents*}[overwrite]{analyzestring.lua}
-- for a given string make a table about the characters it contains
function analyzestring(s)
   tex.print("\\begin{tabular}{rrrrc}\\\\")
   tex.sprint("bytes", "&unicode10", "&unicode16", "&utf8", "&char", "\\\\\\hline")
   for p, c in utf8.codes(s) do
      -- get the UTF8 byte representation
      if (c < 0x80) then
         byt = string.format("0x%02x", string.byte(utf8.char(c), 1))
         position = string.format("%d", p)
      elseif (c < 0x800) then
         byt = string.format("0x%04x", string.byte(utf8.char(c), 1) * 0x100 + string.byte(utf8.char(c), 2))
         position = string.format("%d..%d", p+1, p)
      elseif (c < 0x10000) then
         byt = string.format("0x%06x", string.byte(utf8.char(c), 1) * 0x10000 + string.byte(utf8.char(c), 2) * 0x100 + string.byte(utf8.char(c), 3))
         position = string.format("%d..%d", p+2, p)
      else
         byt = string.format("0x%08x", string.byte(utf8.char(c), 1) * 0x1000000 + string.byte(utf8.char(c), 2) * 0x10000 + string.byte(utf8.char(c), 3) * 0x100 + string.byte(utf8.char(c), 4))
         position = string.format("%d..%d", p+3, p)
      end
      tex.sprint(position, "&",
                 c, "&",
                 string.format("U+%04x", c), "&",
                 byt, "&",
                 utf8.char(c)
                 )
      tex.print("\\\\")
   end
   tex.print("\\end{tabular}")
end
\end{filecontents*}
\directlua{require "analyzestring.lua"}


% output a test case with \case{NUMBER}{WORD}{EXPECTATION}{DESCRIPTION}
\newcommand*{\case}[4]{%
  \noindent #1 %
  \directlua{Babel.arabic.justify_enabled=false}%
  #2 %
  -- #3 %
  \directlua{Babel.arabic.justify_enabled=true}%
  \hfill%
  \fbox{\makebox[5em][s]{#2}}%
  % table about the characters in the test case
  \\{\LTR\tiny%
    #4\\
    \directlua{analyzestring("\luaescapestring{#2}")}%
  }%
  \vskip 10mm%
}

\begin{document}

\case{1}{تَثَنَّى}{تَـثَـنَّى}{Kashidas should be inserted \emph{after} non-spacing marks like ARBIC FATHA, U+064e.}

\case{2}{تَـثَـنَّى}{تَــثَــنَّى}{Existing Kashidas should be further elongated.}

\case{3}{تَــــــثَـنَّى}{تَــثَــنَّى}{Existing Kashidas should be homogenized.}

\case{4}{بِأَبي}{بِـأَبي}{There should be no Kashida at end. But for Arabic Typesetting, there is.}

\end{document}

TEX engine: LuaHBTeX, Version 1.17.0 (TeX Live 2023)

babel version: 2023/08/09 v3.92.22182 The Babel package (from github)

The output per test case is (from right to left): Number, input, expecation, result (box).

In test case 1 you can see, that the diacritcs (vowels) are
displaced horizontally from the letters (consonants) by
kashidas. As far as I know, the FATHA (U+064e) should stay above
the consonant instead of being deferred to the left. The kashida
should be inserted after all the diacritics that belong to a
consonant.

I tried to fix this by changing

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[]*[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

to

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

where the second () is moved behind the regex for the diacritics
[]*. But this makes the diacritic disappear, when a kashida is
inserted behind the consonant the FATHA refers to.

I also tried special rules for consonant+vowel combinations like

kashida.plain.2.0 = { ()ثَ()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

But again, the effect is that the FATHA disappears. So, I guess, we
need 2-letter and 3-letter rules for getting this right. Somehow like
below, but I don't know the syntax for 2 and 3 letter rules.

; 3-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][][]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
; 2-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
; 1-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

In test case 2 you can see, that kashidas, that exist in the input,
prevent further elongation. I think this is like Allow manually inserted tatweel to stretch with Arabic kashida #243.

This can be fixed by adding the kashida into the first regex character class:

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثبـ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

If you want to make the kashida insertion homogenous, like
@amarakon would like to see it in Allow manually inserted tatweel to stretch with Arabic kashida #243, we could drop it in a
1-letter rule (the same way that makes the diacritic go away in my
attempts):

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثب][ـ]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

try this which makes diacritics go away (too bad!) and kashidas homogenous:

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثب][]*[ـ]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

For the font ArabicTypesetting, I get kashidas at the end of a word for some letters.

Could you point be to a documentation of transformation rules?

The text was updated successfully, but these errors were encountered:

jbezos · 2023-08-14T08:07:22Z

I'm facing several technical issues/limitations and I'm a bit stuck. See for example https://tex.stackexchange.com/questions/686767/process-hbox-with-luaotfload and also harfbuzz/harfbuzz#3762 (comment). Some others are related to the fonts, which sometimes don't seem to take into account the kashida (with clearly misplaced diacritics).

I’ll read carefully your looong report (I wish they were all like that 🙂). There are some explanations here:

The horizontal placement of diacritics is under the direct control of babel, and I was working on an option to set it (start, center, end).

lueck · 2023-08-14T09:41:24Z

https://latex3.github.io/babel/guides/non-standard-hyphenation-with-luatex.html

Thanks! That enables me to make more informed experiments.

With the following transformation rules, the horizontal displacement of diacritcs is solved using 1-letter rules:

; insert kashida into pattern with certain consonant combinations
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.1.1 =   { kashida = 500 }
; one diacritic mark: insert kashida behind it
kashida.plain.2.0 = { [يئهشسقفغعضصنمكلظطخحجثتب]()[ًٍَُِّ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.2.1 =   { kashida = 500 }
; two diacritic marks: insert kashida behind them
kashida.plain.3.0 = { [يئهشسقفغعضصنمكلظطخحجثتب][ًٍَُِّ]()[ًٍَُِّ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.3.1 =   { kashida = 500 }
kashida.plain.4.0 = { ()ل()[ًٍَُِّ]*[اأإآ] }
kashida.plain.4.1 =   { kashida = 0 }

But in the output, the kashida is displaced vertically:

lueck · 2023-08-14T09:55:57Z

.. so the y-axis-offset should, that results from lifting diacritics, should be reset before inserting kashida (and maybe restored afterwards).---I can't guarantee, that this is a TeX-like formulation of a fix...

Diacritics (vowels) should keep their horizontal position above the consonant they refer to. The old kashida insertion rule displaced them horizontally, because kashidas were inserted between the diacritic and the consonant they belong to. See latex3#257. These rules work for one or two diacritics on a consonant.

I order make the change as small as possible, the reset is only performed, if the preceding node is a character---and not a diacritic mark. See latex3#257.

lueck · 2023-08-15T21:11:46Z

With the changes from my kashida-after-diacritics branch, I now get a result for my case 1, which I am happy with:

kashidas inserted after diacritics
kashidas at the baseline without y or x shift.

If you would rather keep the kashida.plain transform as it is, I would suggest to make this to an alternative transform called kashida.after.diacritics.

Should I open a PR?

lueck · 2023-08-15T22:30:44Z

Hm, with other fonts in still get bad results where the kashida is shifted above the baseline for some character combinations.

jbezos · 2023-08-18T15:10:19Z

I'm somewhat busy right now. Allow me a week or so.

…itics When a kashida is inserted behind a diacritics, the diacritic's glyph is not the node which should be used for creating the kashida's node (by copying), but the character's node. Copying the diacritic's node results in x and y offset, which displace the kashida. By copying the node of the character (on which the diacritic is placed) or by copying a ghost node we get correct x and y offsets to adapt the kashida smoothly to the character. This feature can be enabled by `\directlua{Babel.arabic.kashida_after_diacritics=true}` or disabled by setting it to `false`. This fixes issue latex3#257. But setting `Babel.arabic.kashida_after_diacritics` from the name of the transformation rules would be great.

lueck · 2023-08-21T16:18:03Z

@jbezos No problem! Sorry for mixin in #243 and writing such a cumulative issue. Also my \case{4}... should be an other issue, see #258.

I managed to get very fine results in the meantime.

In order to leave kashida.plain as it is, I made another branch where I added justification rules named kashida.afterdiacritics.plain. I also squashed my suggested changes to babel.dtx into one commit in order to make it more comprehensible.

By default, the logic of kashida insertion is unchanged. Only with \directlua{Babel.arabic.kashida_after_diacritics = true} the creation of the node for a kashida is changed, so that it can be placed correctly.

This is the result I get with Babel.arabic.kashida_after_diacritics = false:

And this is the result I get with Babel.arabic.kashida_after_diacritics = true. If you look very carefully, I'll notice that the kashida is not always at the same y offset, which is a feature of this font.

New cases 5 and 6:

\case{5}{للشُهْبِ}{لـلــشُـهْبِ}{Kashida 3 displaced}

\case{6}{تَأَصَّلَ}{تَـأَصَّـلَ}{Kashidas with bad x and y offset.}

jbezos · 2023-08-29T14:56:34Z

With the changes from my kashida-after-diacritics branch,

Thanks. I’m reading the code and there is a point that can mislead and should be clarified. FreeSerif doesn’t use the PUA, but luaotfload, mainly as a trick to access glyphs without a Unicode point. Relying on what luaotfload does internally isn’t safe.
The problem is that in the justification step, the node list often contains these PUA codes, the exact meaning of which is often unknown. This is one of the technical issues/limitations I was talking about.

jbezos · 2023-08-30T07:37:20Z

I’ll work on some of your ideas. The new transform can be useful in ‘plain’ fonts, not involving ligatures, but with the latter it’s still an unsolved issue, except by creating rules specific to a font. For the JALT table I devised a hack based on parsing twice some frequent cases, with the normal form and the elongated one, but it’s basically a proof of concept that can’t go very far (and it only works with Sakkal Majalla, and not quite – again diacritics is the problem).

The vertical positioning of tashkil is not (usually) fixed, and they are shifted by the font depending on the character. I was working on something similar to the JALT variants to catch the correct yoffset (and xoffset, actually) with kashida, but it seems some (many) fonts don’t bother to deal with kashida and they are clearly misplaced (kasrah is usually too low).

jbezos · 2023-09-21T17:18:59Z

Your transform are now available (in version 3,94), with name kashida.base:

https://latex3.github.io/babel/news/whats-new-in-babel-3.94.html#new-transform-for-kashida

jbezos pushed a commit that referenced this issue Aug 31, 2023

Start working on #257. Hebrew encodings in ini.

7a5f5f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic diacritics displaced by kashida.plain transforms #257

Arabic diacritics displaced by kashida.plain transforms #257

lueck commented Aug 13, 2023

jbezos commented Aug 14, 2023

lueck commented Aug 14, 2023

lueck commented Aug 14, 2023

lueck commented Aug 15, 2023

lueck commented Aug 15, 2023

jbezos commented Aug 18, 2023

lueck commented Aug 21, 2023

jbezos commented Aug 29, 2023

jbezos commented Aug 30, 2023

jbezos commented Sep 21, 2023

Arabic diacritics displaced by kashida.plain transforms #257

Arabic diacritics displaced by kashida.plain transforms #257

Comments

lueck commented Aug 13, 2023

jbezos commented Aug 14, 2023

lueck commented Aug 14, 2023

lueck commented Aug 14, 2023

lueck commented Aug 15, 2023

lueck commented Aug 15, 2023

jbezos commented Aug 18, 2023

lueck commented Aug 21, 2023

jbezos commented Aug 29, 2023

jbezos commented Aug 30, 2023

jbezos commented Sep 21, 2023