Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic diacritics displaced by kashida.plain transforms #257

Open
lueck opened this issue Aug 13, 2023 · 10 comments
Open

Arabic diacritics displaced by kashida.plain transforms #257

lueck opened this issue Aug 13, 2023 · 10 comments

Comments

@lueck
Copy link

lueck commented Aug 13, 2023

Thanks for your great effort on the kashida feature! I know, that
it's still in experimental state. I'd like to point to some issues.

Here's a MWE with an analytic tool for investigating the input
characters (for helping people like me who have to typeset Arabic but
can't read it). It uses \makebox[LENGTH][s]{TESTCASE} for forcing
kashida elongation on single words for demonstration.

\documentclass{book}

\usepackage{luabidi}
\setRTLmain

\usepackage[ngerman,english,bidi=basic]{babel}[2021/05/16]% version 3.59 or later

\babelprovide[import,main,%
justification=kashida,%
transforms=kashida.plain%
]{arabic}

\babelfont{rm}[Scale=3]{FreeSerif} % {ArabicTypesetting} %

\usepackage{luacode}

\begin{filecontents*}[overwrite]{analyzestring.lua}
-- for a given string make a table about the characters it contains
function analyzestring(s)
   tex.print("\\begin{tabular}{rrrrc}\\\\")
   tex.sprint("bytes", "&unicode10", "&unicode16", "&utf8", "&char", "\\\\\\hline")
   for p, c in utf8.codes(s) do
      -- get the UTF8 byte representation
      if (c < 0x80) then
         byt = string.format("0x%02x", string.byte(utf8.char(c), 1))
         position = string.format("%d", p)
      elseif (c < 0x800) then
         byt = string.format("0x%04x", string.byte(utf8.char(c), 1) * 0x100 + string.byte(utf8.char(c), 2))
         position = string.format("%d..%d", p+1, p)
      elseif (c < 0x10000) then
         byt = string.format("0x%06x", string.byte(utf8.char(c), 1) * 0x10000 + string.byte(utf8.char(c), 2) * 0x100 + string.byte(utf8.char(c), 3))
         position = string.format("%d..%d", p+2, p)
      else
         byt = string.format("0x%08x", string.byte(utf8.char(c), 1) * 0x1000000 + string.byte(utf8.char(c), 2) * 0x10000 + string.byte(utf8.char(c), 3) * 0x100 + string.byte(utf8.char(c), 4))
         position = string.format("%d..%d", p+3, p)
      end
      tex.sprint(position, "&",
                 c, "&",
                 string.format("U+%04x", c), "&",
                 byt, "&",
                 utf8.char(c)
                 )
      tex.print("\\\\")
   end
   tex.print("\\end{tabular}")
end
\end{filecontents*}
\directlua{require "analyzestring.lua"}


% output a test case with \case{NUMBER}{WORD}{EXPECTATION}{DESCRIPTION}
\newcommand*{\case}[4]{%
  \noindent #1 %
  \directlua{Babel.arabic.justify_enabled=false}%
  #2 %
  -- #3 %
  \directlua{Babel.arabic.justify_enabled=true}%
  \hfill%
  \fbox{\makebox[5em][s]{#2}}%
  % table about the characters in the test case
  \\{\LTR\tiny%
    #4\\
    \directlua{analyzestring("\luaescapestring{#2}")}%
  }%
  \vskip 10mm%
}

\begin{document}

\case{1}{تَثَنَّى}{تَـثَـنَّى}{Kashidas should be inserted \emph{after} non-spacing marks like ARBIC FATHA, U+064e.}

\case{2}{تَـثَـنَّى}{تَــثَــنَّى}{Existing Kashidas should be further elongated.}

\case{3}{تَــــــثَـنَّى}{تَــثَــنَّى}{Existing Kashidas should be homogenized.}

\case{4}{بِأَبي}{بِـأَبي}{There should be no Kashida at end. But for Arabic Typesetting, there is.}

\end{document}

TEX engine: LuaHBTeX, Version 1.17.0 (TeX Live 2023)

babel version: 2023/08/09 v3.92.22182 The Babel package (from github)

false-kashida-0

The output per test case is (from right to left): Number, input, expecation, result (box).

  1. In test case 1 you can see, that the diacritcs (vowels) are
    displaced horizontally from the letters (consonants) by
    kashidas. As far as I know, the FATHA (U+064e) should stay above
    the consonant instead of being deferred to the left. The kashida
    should be inserted after all the diacritics that belong to a
    consonant.

I tried to fix this by changing

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[]*[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

to

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

where the second () is moved behind the regex for the diacritics
[]*. But this makes the diacritic disappear, when a kashida is
inserted behind the consonant the FATHA refers to.

I also tried special rules for consonant+vowel combinations like

kashida.plain.2.0 = { ()ثَ()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

But again, the effect is that the FATHA disappears. So, I guess, we
need 2-letter and 3-letter rules for getting this right. Somehow like
below, but I don't know the syntax for 2 and 3 letter rules.

; 3-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][][]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
; 2-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
; 1-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
  1. In test case 2 you can see, that kashidas, that exist in the input,
    prevent further elongation. I think this is like Allow manually inserted tatweel to stretch with Arabic kashida #243.

This can be fixed by adding the kashida into the first regex character class:

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثبـ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
  1. If you want to make the kashida insertion homogenous, like
    @amarakon would like to see it in Allow manually inserted tatweel to stretch with Arabic kashida #243, we could drop it in a
    1-letter rule (the same way that makes the diacritic go away in my
    attempts):
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثب][ـ]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

try this which makes diacritics go away (too bad!) and kashidas homogenous:

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثب][]*[ـ]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
  1. For the font ArabicTypesetting, I get kashidas at the end of a word for some letters.

Could you point be to a documentation of transformation rules?

@jbezos
Copy link
Contributor

jbezos commented Aug 14, 2023

I'm facing several technical issues/limitations and I'm a bit stuck. See for example https://tex.stackexchange.com/questions/686767/process-hbox-with-luaotfload and also harfbuzz/harfbuzz#3762 (comment). Some others are related to the fonts, which sometimes don't seem to take into account the kashida (with clearly misplaced diacritics).

I’ll read carefully your looong report (I wish they were all like that 🙂). There are some explanations here:

The horizontal placement of diacritics is under the direct control of babel, and I was working on an option to set it (start, center, end).

@lueck
Copy link
Author

lueck commented Aug 14, 2023

https://latex3.github.io/babel/guides/non-standard-hyphenation-with-luatex.html

Thanks! That enables me to make more informed experiments.

With the following transformation rules, the horizontal displacement of diacritcs is solved using 1-letter rules:

; insert kashida into pattern with certain consonant combinations
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.1.1 =   { kashida = 500 }
; one diacritic mark: insert kashida behind it
kashida.plain.2.0 = { [يئهشسقفغعضصنمكلظطخحجثتب]()[ًٍَُِّ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.2.1 =   { kashida = 500 }
; two diacritic marks: insert kashida behind them
kashida.plain.3.0 = { [يئهشسقفغعضصنمكلظطخحجثتب][ًٍَُِّ]()[ًٍَُِّ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.3.1 =   { kashida = 500 }
kashida.plain.4.0 = { ()ل()[ًٍَُِّ]*[اأإآ] }
kashida.plain.4.1 =   { kashida = 0 }

But in the output, the kashida is displaced vertically:

case1

@lueck
Copy link
Author

lueck commented Aug 14, 2023

.. so the y-axis-offset should, that results from lifting diacritics, should be reset before inserting kashida (and maybe restored afterwards).---I can't guarantee, that this is a TeX-like formulation of a fix...

lueck added a commit to lueck/babel that referenced this issue Aug 15, 2023
Diacritics (vowels) should keep their horizontal position above the
consonant they refer to. The old kashida insertion rule displaced them
horizontally, because kashidas were inserted between the diacritic and
the consonant they belong to. See latex3#257.

These rules work for one or two diacritics on a consonant.
lueck added a commit to lueck/babel that referenced this issue Aug 15, 2023
I order make the change as small as possible, the reset is only
performed, if the preceding node is a character---and not a diacritic
mark. See latex3#257.
@lueck
Copy link
Author

lueck commented Aug 15, 2023

With the changes from my kashida-after-diacritics branch, I now get a result for my case 1, which I am happy with:

  • kashidas inserted after diacritics
  • kashidas at the baseline without y or x shift.

case1-fixed-fs

If you would rather keep the kashida.plain transform as it is, I would suggest to make this to an alternative transform called kashida.after.diacritics.

Should I open a PR?

@lueck
Copy link
Author

lueck commented Aug 15, 2023

Hm, with other fonts in still get bad results where the kashida is shifted above the baseline for some character combinations.

@jbezos
Copy link
Contributor

jbezos commented Aug 18, 2023

I'm somewhat busy right now. Allow me a week or so.

lueck added a commit to lueck/babel that referenced this issue Aug 21, 2023
…itics

When a kashida is inserted behind a diacritics, the diacritic's glyph
is not the node which should be used for creating the kashida's
node (by copying), but the character's node. Copying the diacritic's
node results in x and y offset, which displace the kashida. By copying
the node of the character (on which the diacritic is placed) or by
copying a ghost node we get correct x and y offsets to adapt the
kashida smoothly to the character.

This feature can be enabled by
`\directlua{Babel.arabic.kashida_after_diacritics=true}` or disabled
by setting it to `false`.

This fixes issue latex3#257. But setting
`Babel.arabic.kashida_after_diacritics` from the name of the
transformation rules would be great.
@lueck
Copy link
Author

lueck commented Aug 21, 2023

@jbezos No problem! Sorry for mixin in #243 and writing such a cumulative issue. Also my \case{4}... should be an other issue, see #258.

I managed to get very fine results in the meantime.

In order to leave kashida.plain as it is, I made another branch where I added justification rules named kashida.afterdiacritics.plain. I also squashed my suggested changes to babel.dtx into one commit in order to make it more comprehensible.

By default, the logic of kashida insertion is unchanged. Only with \directlua{Babel.arabic.kashida_after_diacritics = true} the creation of the node for a kashida is changed, so that it can be placed correctly.

This is the result I get with Babel.arabic.kashida_after_diacritics = false:
displaced

And this is the result I get with Babel.arabic.kashida_after_diacritics = true. If you look very carefully, I'll notice that the kashida is not always at the same y offset, which is a feature of this font.

corrected

New cases 5 and 6:

\case{5}{للشُهْبِ}{لـلــشُـهْبِ}{Kashida 3 displaced}

\case{6}{تَأَصَّلَ}{تَـأَصَّـلَ}{Kashidas with bad x and y offset.}

@jbezos
Copy link
Contributor

jbezos commented Aug 29, 2023

With the changes from my kashida-after-diacritics branch,

Thanks. I’m reading the code and there is a point that can mislead and should be clarified. FreeSerif doesn’t use the PUA, but luaotfload, mainly as a trick to access glyphs without a Unicode point. Relying on what luaotfload does internally isn’t safe.
The problem is that in the justification step, the node list often contains these PUA codes, the exact meaning of which is often unknown. This is one of the technical issues/limitations I was talking about.

@jbezos
Copy link
Contributor

jbezos commented Aug 30, 2023

I’ll work on some of your ideas. The new transform can be useful in ‘plain’ fonts, not involving ligatures, but with the latter it’s still an unsolved issue, except by creating rules specific to a font. For the JALT table I devised a hack based on parsing twice some frequent cases, with the normal form and the elongated one, but it’s basically a proof of concept that can’t go very far (and it only works with Sakkal Majalla, and not quite – again diacritics is the problem).

The vertical positioning of tashkil is not (usually) fixed, and they are shifted by the font depending on the character. I was working on something similar to the JALT variants to catch the correct yoffset (and xoffset, actually) with kashida, but it seems some (many) fonts don’t bother to deal with kashida and they are clearly misplaced (kasrah is usually too low).

@jbezos
Copy link
Contributor

jbezos commented Sep 21, 2023

Your transform are now available (in version 3,94), with name kashida.base:

https://latex3.github.io/babel/news/whats-new-in-babel-3.94.html#new-transform-for-kashida

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants